[00:00:05] now: I've got to run, I'm already running late for my thing! [00:00:14] Thanks thcipriani|afk [00:00:20] yw :) [00:21:11] !log gerrit: rolled back to 2.13.4-13-gc0c5cc4742 from 2.13.8. T152640 rearing its ugly head again (login issues) [00:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:21] T152640: Cannot log into Gerrit as of recent upgrade - https://phabricator.wikimedia.org/T152640 [00:27:10] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:34:40] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 1890 [00:37:41] RoanKattouw, fixed it. It wasn't us, though: https://gerrit.wikimedia.org/r/#/c/357523/ [00:38:17] (03PS1) 10Chad: gerrit (2.13.8+git1-wmf.5) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/357524 [00:39:43] (03PS2) 10Chad: gerrit (2.13.8+git1-wmf.5) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/357524 (https://phabricator.wikimedia.org/T158946) [00:40:19] (03CR) 10Paladox: [C: 031] gerrit (2.13.8+git1-wmf.5) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/357524 (https://phabricator.wikimedia.org/T158946) (owner: 10Chad) [00:41:06] (03CR) 10Dzahn: [C: 032] gerrit (2.13.8+git1-wmf.5) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/357524 (https://phabricator.wikimedia.org/T158946) (owner: 10Chad) [01:00:20] 10Operations, 10Prometheus-metrics-monitoring: prometheus-node-exporter - invalid group: ‘prometheus:prometheus' - https://phabricator.wikimedia.org/T167245#3321908 (10Dzahn) [01:31:20] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:59:20] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [02:27:51] 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3321960 (10greg) >>! In T166888#3321592, @faidon wrote: > So it doesn't really like sound like the primary use case is jobs like the operations/puppet linting, at least... [02:29:51] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [02:30:54] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 57s) [02:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:51] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [02:44:28] (03PS1) 10Aude: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (for beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357532 [02:45:38] (03PS2) 10Aude: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (for beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357532 (https://phabricator.wikimedia.org/T158323) [02:49:32] (03PS1) 10Aude: Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) [03:04:46] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.4) (duration: 14m 29s) [03:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:40] PROBLEM - Apache HTTP on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.076 second response time [03:07:40] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.282 second response time [03:09:20] PROBLEM - HHVM rendering on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [03:10:20] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 74476 bytes in 2.565 second response time [03:11:40] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jun 7 03:11:40 UTC 2017 (duration 6m 54s) [03:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:30] PROBLEM - HHVM rendering on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [03:22:30] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 74415 bytes in 1.189 second response time [04:25:11] PROBLEM - HHVM rendering on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.076 second response time [04:26:11] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 74384 bytes in 0.605 second response time [04:30:11] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [05:08:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:08:50] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:11:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:12:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:26:50] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:28:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:28:47] <_joe_> uhm [05:29:33] it seems that was triggered by scap for wikidata deploy? [05:29:40] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:29:51] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [05:30:08] (03PS1) 10Marostegui: db-eqiad.php: Repool db1053, depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357550 (https://phabricator.wikimedia.org/T166206) [05:30:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:31:08] <_joe_> yeah again [05:31:17] <_joe_> mutante: I think tha't a red herring [05:31:24] ok [05:31:35] <_joe_> also, I don't see the scap sync [05:31:50] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:31:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1053, depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357550 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [05:31:59] <_joe_> mutante: I think it's varnish, but since it's been going on for a fat 4 hours now, I'll first have breakfast [05:32:06] yea, that was already completed over 2 hours ago. 20:08 < logmsgbot> !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.4) (duration: 14m 29s) [05:32:32] <_joe_> it's 3 hours sorry [05:32:43] <_joe_> since 2.40 UTC we have those peaks [05:32:51] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [05:33:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:33:32] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1053, depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357550 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [05:33:45] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1053, depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357550 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [05:34:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [05:35:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1053, depool db1056 - T166206 (duration: 01m 03s) [05:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:17] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [05:35:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:36:40] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:39:59] <_joe_> it seems the cause is not varnish [05:40:20] <_joe_> marostegui: can you hold on with you changes for now? [05:41:06] yes [05:41:13] I am actually checking it was not my change [05:41:30] because it went straight after my deploy [05:41:42] Ah no, it started before [05:41:59] (my irssi was a bit crazy) [05:42:02] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors-datacenters?orgId=1 says otherwise and that it's all green [05:42:04] BUt yes, I am not doing anything else [05:42:12] but i was about to say that earlier too and then it wasnt again [05:42:35] <_joe_> yeah there is a recurring issue, clearly [05:44:41] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:03:40] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:04:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:04:50] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:05:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:09:20] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=345.40 Read Requests/Sec=3420.20 Write Requests/Sec=11.70 KBytes Read/Sec=26165.60 KBytes_Written/Sec=3558.80 [06:16:20] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.70 Read Requests/Sec=3.90 Write Requests/Sec=76.80 KBytes Read/Sec=16.40 KBytes_Written/Sec=420.40 [06:28:31] I checked https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X and the last peak of 503s seemed to be related to ints for some eqiad cp hosts, like cp1055 [06:29:02] maybe this is the recurring varnish backend issue [06:30:00] <_joe_> elukey: no. [06:30:06] mmm now that I am checking I am seeing more, nevermind [06:30:12] <_joe_> coordinate with others [06:30:14] <_joe_> ;) [06:30:20] <_joe_> we figured it out in the meantime [06:30:40] ah didn't check sec, thanks [06:44:55] (03PS1) 10Marostegui: db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357552 [06:46:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357552 (owner: 10Marostegui) [06:47:57] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357552 (owner: 10Marostegui) [06:48:06] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357552 (owner: 10Marostegui) [06:49:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 - T166206 (duration: 00m 44s) [06:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:23] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [06:58:10] (03PS2) 10Elukey: Delete unused role/common/analytics/hadoop configs [puppet] - 10https://gerrit.wikimedia.org/r/357418 [07:00:07] (03CR) 10Elukey: [C: 032] Delete unused role/common/analytics/hadoop configs [puppet] - 10https://gerrit.wikimedia.org/r/357418 (owner: 10Elukey) [07:02:01] mutante: if you are around, https://gerrit.wikimedia.org/r/#/c/356236 was not merged on puppet master [07:02:31] seems easy enough, it only touches tox.ini [07:04:22] +1 by hashar, +2 by Daniel seems good enough, merging [07:05:37] mutante: merged! [07:07:51] just tried tox -e pep8 ./modules/varnish/files/varnishapi.py, works fine on my laptop [07:08:54] (flake8 dep is downloaded correctly, everything passes) [07:11:07] 10Operations, 10HHVM: Upload hhvm to stretch apt repo in apt.wikimedia.org - https://phabricator.wikimedia.org/T167225#3322240 (10MoritzMuehlenhoff) 05Open>03declined We won't provide HHVM packages for stretch before we start the stretch migration of the production app servers since building HHVM and the e... [07:20:59] (03PS1) 10Ema: VCL: basic support to return HTTP 420, apply it to UA:wikiScrape/0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/357557 [07:22:16] !log Deploy alter table on db1047 enwiki.revision - T162807 [07:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:26] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [07:22:40] marostegui: do1047 is in your nightmares nowadays :D [07:22:45] *db [07:22:49] I know…. [07:22:59] thanks for working on it! [07:23:04] Can't wait to decommission that host! [07:23:15] let me know if I can help! Maybe there is a chance to learn something [07:23:32] haha nah, it is just a long running alter table [07:24:58] I promise that I will not stop mysql this time before checking alter tables :D [07:25:33] marostegui: about db104[67], do you think that we'll be able to order the hw next quarter ? [07:25:50] elukey: I would hope so, yes [07:26:05] super, let me know we need to do anything [07:26:18] elukey: Will do, not sure under which budget that goes [07:26:49] good question [07:32:02] elukey: sorry for that and yes, you did right. thank you! /me out again [07:32:43] 10Operations, 10netops: Faulty link between cr2-codfw and cr1-eqdfw - https://phabricator.wikimedia.org/T167261#3322265 (10ayounsi) [07:37:21] (03PS1) 10Muehlenhoff: Use new repository layout for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) [07:53:28] (03CR) 10Muehlenhoff: "Patch for apt source via https://gerrit.wikimedia.org/r/#/c/357559/" [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [08:02:19] !log Run redact_sanitarium on db1095 for dewiki - T153743 [08:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:27] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [08:02:50] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [08:06:29] (03PS3) 10Filippo Giunchedi: aptrepo: add hp-mcp-stretch and thirdparty/hwraid [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) [08:10:13] (03PS1) 10Muehlenhoff: Configure fixed lock manager ports for labstore NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/357562 (https://phabricator.wikimedia.org/T165136) [08:12:08] (03CR) 10Gehel: Add Shiny Server module and Discovery Dashboards role/profile (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga) [08:12:19] (03CR) 10Filippo Giunchedi: aptrepo: add hp-mcp-stretch and thirdparty/hwraid (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [08:16:42] (03CR) 10Gehel: kibana: allow any arbitrary setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356900 (owner: 10Hashar) [08:22:44] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3322381 (10ops-monitoring-bot) [08:23:22] (03CR) 10Hashar: "I felt that adding settings each time we need one would be cumbersome. The only reason I made this totally open was to add "elasticsearch" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356900 (owner: 10Hashar) [08:27:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357564 (https://phabricator.wikimedia.org/T166206) [08:28:53] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3322398 (10fgiunchedi) [08:29:05] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3322381 (10fgiunchedi) a:03Cmjohnson [08:32:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357564 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [08:33:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357564 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [08:33:49] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357564 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [08:34:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1056 - T166206 (duration: 00m 43s) [08:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:44] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [08:34:51] (03CR) 10Muehlenhoff: [C: 031] aptrepo: add hp-mcp-stretch and thirdparty/hwraid [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [08:37:01] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3322434 (10Gehel) [08:37:43] 10Operations, 10Prometheus-metrics-monitoring: prometheus-node-exporter - invalid group: ‘prometheus:prometheus' - https://phabricator.wikimedia.org/T167245#3321908 (10fgiunchedi) Indeed, the problem there I think is that `prometheus` user exists in labs but not the group, was node-exporter working otherwise? [08:38:13] 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3322464 (10ema) [08:38:52] 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3318229 (10ema) [08:40:20] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:2 - OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK [08:40:22] ACKNOWLEDGEMENT - HP RAID on ms-be1016 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:2 - OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167268 [08:40:25] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167268#3322470 (10ops-monitoring-bot) [08:41:20] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdh1] [08:42:21] 10Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167268#3322474 (10Volans) [08:42:32] godog: ^^^ also puppet broken, as expected ;) [08:42:40] you're breaking too many disks lately :-P [08:43:15] ask the users to upload/download less, less disks broken :P [08:43:42] :D [08:43:56] I need to check later why we got 2 tasks though [08:44:23] T167268 and T167264 seems the same, but have different output [08:44:23] T167268: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167268 [08:44:23] T167264: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264 [08:44:24] my fault, the first is manual because I thought the disk was already failed on the controller but it wasn't [08:44:38] then I marked the disk failed manually on the controller too [08:46:29] 10Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167268#3322489 (10fgiunchedi) [08:46:31] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3322487 (10fgiunchedi) [08:46:39] godog: got it :) feel free to clse one [08:46:53] one less thing to check for me, mistery solved ;) [08:46:57] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3322381 (10fgiunchedi) ``` 09:43 I need to check later why we got 2 tasks though 09:44 my fault, the first is manual because I thought the disk was already failed on the controller but it wasn't... [08:47:35] yeah, we're still on the mistery of why hw can be so crap sometimes [08:50:19] !log Deploy alter table s4 - db1056 - T166206 [08:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:28] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [08:51:46] (03PS1) 10Gehel: Upgrade kibana to v5.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/357565 (https://phabricator.wikimedia.org/T167266) [08:54:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357566 (https://phabricator.wikimedia.org/T166205) [08:54:39] (03PS1) 10Elukey: Bump zookeeper client version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357567 [08:54:55] (03Abandoned) 10Elukey: Bump zookeeper client version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357567 (owner: 10Elukey) [08:55:05] not needed, pebkac [08:56:32] (03PS1) 10Filippo Giunchedi: swift: mask object reconstructor on >= jessie [puppet] - 10https://gerrit.wikimedia.org/r/357568 (https://phabricator.wikimedia.org/T162609) [08:56:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357566 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [08:57:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357566 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [08:57:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357566 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [08:58:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1076 - T166205 (duration: 00m 43s) [08:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:43] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [08:58:52] !log Deploy alter table on s2 - db1076 - T166205 [08:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:05] (03CR) 10Filippo Giunchedi: [C: 032] swift: mask object reconstructor on >= jessie [puppet] - 10https://gerrit.wikimedia.org/r/357568 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [09:08:37] 10Operations: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703#3322545 (10ema) p:05Triage>03Normal [09:08:55] (03PS1) 10Filippo Giunchedi: install_server: move ms-be2* trusty hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357569 (https://phabricator.wikimedia.org/T162609) [09:11:44] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/357565 (https://phabricator.wikimedia.org/T167266) (owner: 10Gehel) [09:12:44] (03PS2) 10Gehel: Upgrade kibana to v5.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/357565 (https://phabricator.wikimedia.org/T167266) [09:15:11] !log upgrade lvs4001-4004 to jessie 8.8 point release T164703 [09:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:20] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [09:16:09] (03CR) 10Gehel: [C: 032] Upgrade kibana to v5.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/357565 (https://phabricator.wikimedia.org/T167266) (owner: 10Gehel) [09:16:47] (03CR) 10Filippo Giunchedi: [C: 032] install_server: move ms-be2* trusty hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357569 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [09:16:47] (03PS2) 10Filippo Giunchedi: install_server: move ms-be2* trusty hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357569 (https://phabricator.wikimedia.org/T162609) [09:22:21] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3322556 (10Gehel) kibana 5.3.3 is now uploaded to reprero [09:28:58] !log upgrading kibana to v5.3.3 on logstash cluster - T167266 [09:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:08] T167266: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266 [09:32:12] gehel: FYI there's cronspam from logstash, /bin/sh: 1: /usr/local/bin/logstash_delete_index: not found [09:32:27] damn, still there? [09:32:48] yeah logstash1002 sent it this morning too [09:32:57] only that machien though [09:34:01] strange, that's the old cron that has been removed last week. I can't see it with crontab -l ... [09:34:46] it'd be in /var/spool/cron/crontabs then [09:35:32] nope, not there either [09:38:18] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3322593 (10Gehel) Kibana is now upgraded on the production logstash cluster and in deployment-logstash2 [09:41:38] odd, let me know if you find it! [09:42:01] godog: I'm grepping left and right, but no, I don't find it... [09:42:55] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3322617 (10Gehel) Note: this upgrade is patching the issues : https://discuss.elastic.co/t/elastic-stack-5-4-1-and-5-3-3-security-updates/87952 [09:43:14] meh the only other thing I can think of is the olde "service cron restart" [09:44:06] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3322434 (10MoritzMuehlenhoff) Which is CVE-2017-8440 (just to have the ID indexed when searching for it in Phab) [09:45:53] godog: I don't see the email though [09:46:20] ahhh got into spam [09:46:57] technically correct! [09:47:32] godog: https://giphy.com/gifs/the-simpsons-trash-garbage-5xaOcLCBzBw4QrtdDP2 [09:48:05] gehel: in which file was logstash_delete_index referenced? [09:48:07] haahh yes I was rummaging in spam [09:48:59] volans: the cron was setup by puppet, so most probably in /var/spool/crontabs [09:49:03] !log upgrade lvs[3001-3004] to jessie 8.8 point release T164703 [09:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:11] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [09:49:13] I can see the log where puppet removed that cron [09:49:47] before or after the email? [09:49:55] well before [09:50:22] I think I'll try godog advice and check tomorrow ... [09:50:37] lunch, back later [09:55:41] RECOVERY - IPMI Temperature on ms-be2014 is OK: Sensor Type(s) Temperature Status: OK [09:56:43] meh, I just rebooted it for unrelated reasons [09:59:52] volans: good morning. I hope I havent not been too pedantic on the flake8_no_extension review ( https://gerrit.wikimedia.org/r/#/c/357197 ) [10:00:08] (03PS1) 10Filippo Giunchedi: install_server: s/ubuntu/debian/ for stretch ms-be2* machines [puppet] - 10https://gerrit.wikimedia.org/r/357576 [10:00:16] godog: interesting, the reboot fixed the IPMI temperature check? [10:01:06] hashar: no, have you seen the latest patchset ;) [10:01:09] ema: seems like it [10:01:20] (03PS2) 10Filippo Giunchedi: install_server: s/ubuntu/debian/ for stretch ms-be2* machines [puppet] - 10https://gerrit.wikimedia.org/r/357576 [10:01:35] volans: yeah gotta review it :-} [10:02:12] volans: the py.erb are all bad design patterns really ... then there is only 9 such files so it is not too much of a troubles [10:02:28] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] install_server: s/ubuntu/debian/ for stretch ms-be2* machines [puppet] - 10https://gerrit.wikimedia.org/r/357576 (owner: 10Filippo Giunchedi) [10:02:29] should be much simpler now, than it could be modified later to be reused for different checks (rake, shellcheck, etc) [10:02:53] 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3322672 (10ema) @fgiunchedi just rebooted ms-be2014 for unrelated reasons and the reboot alone fixed the issue. Perhaps IPMI can end up in some weird state and that gets fixed upon reboot? [10:02:55] hashar: yeah, that's why I'd like it to show them so that we can fix them ;) 1 or 2 are .erb without .py.erb also [10:08:12] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, and 3 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3322681 (10Joe) [10:10:20] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:10:21] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:12:11] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [10:12:11] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:17:35] 10Operations, 10Goal, 10Kubernetes: Define a process to keep images up-to-date on similar standards as the rest of production - https://phabricator.wikimedia.org/T162043#3322706 (10Joe) [10:17:55] 10Operations, 10Kubernetes, 10Prod-Kubernetes (Experiment), 10User-Joe: Make security updates of docker images manageable - https://phabricator.wikimedia.org/T167269#3322625 (10Joe) [10:26:09] !log upgrade lvs1*/lvs2* to jessie 8.8 point release T164703 [10:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:18] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [10:29:14] !log installing tiff regression security update on trusty [10:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:31] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:30] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:33:39] the lvs2006 puppetfail was me ^ [10:34:41] (03CR) 10Hashar: [C: 031] "Thank you for the follow up. On my machine the script takes roughly 5 seconds that will be added to every build of the job, but i think th" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans) [10:35:05] volans: +1 from me. And maybe we can fix up the few shebangs that prevent libmagic detection [10:35:36] thanks for the review [10:37:12] volans: twist, if you pass --statistics to flake8, it shows a a breakdown count of each errors :} [10:37:30] most seems to e whitespace related so should be easy to fix up [10:47:42] 10Operations, 10netops: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274#3322805 (10ayounsi) [10:54:31] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [11:01:02] PROBLEM - DPKG on lvs1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:01:55] me again ^ [11:02:02] RECOVERY - DPKG on lvs1002 is OK: All packages OK [11:03:01] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-node-exporter] [11:03:25] that's me instead ^ [11:03:41] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2017994 [11:04:01] PROBLEM - DPKG on lvs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:04:31] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3322892 (10MarcoAurelio) [11:05:01] RECOVERY - DPKG on lvs1001 is OK: All packages OK [11:06:05] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3283276 (10MarcoAurelio) @jcrespo and @Marostegui for your opinion. [11:07:11] PROBLEM - DPKG on lvs1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:08:44] !log restarting cron on logstash cluster [11:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:11] RECOVERY - DPKG on lvs1003 is OK: All packages OK [11:17:20] (03PS1) 10Ayounsi: LibreNMS: enable 2FA [puppet] - 10https://gerrit.wikimedia.org/r/357585 (https://phabricator.wikimedia.org/T164911) [11:19:09] (03CR) 10Ayounsi: [C: 032] LibreNMS: enable 2FA [puppet] - 10https://gerrit.wikimedia.org/r/357585 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [11:22:41] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [11:25:28] 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3322958 (10ema) According to [[http://www.gnu.org/software/freeipmi/freeipmi-faq.html#Why-am-I-seeing-so-many-_0027internal-IPMI-error_0027-or-_0027driver-busy_0027-messages_003f | the Freeipmi FAQs ]], /dev/ipmi0 shoul... [11:26:49] 10Operations, 10netops, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3322959 (10ayounsi) [11:30:01] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [11:30:11] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:30:12] RECOVERY - IPMI Temperature on ocg1002 is OK: Sensor Type(s) Temperature Status: OK [11:31:22] 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3322963 (10faidon) Great, thanks :) I'm looking at the output of a Jenkins job and it looks like it takes about a minute to execute, so I guess we have two semi-related... [11:33:01] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [11:41:15] (03CR) 10Faidon Liambotis: [C: 04-1] "5 seconds on every job is quite a bit -- and on the CI instances with empty pagecaches and in VMs in a shared infrastructure without SSDs " [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans) [11:42:40] 10Operations, 10netops, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3322995 (10ayounsi) [11:48:29] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:49:15] (03CR) 10Faidon Liambotis: [C: 04-1] "Minor syntax change inside. Moreover, d-i + labs_bootstrapvz + docker (+ package_builder) all need the same change as well. For them third" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [11:54:30] (03CR) 10Faidon Liambotis: [C: 04-1] "This would work, but IMHO it's ugly because:" [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [11:58:52] 10Operations, 10Office-IT, 10netops: Some BGP sessions to the SF Office down - https://phabricator.wikimedia.org/T167281#3323004 (10ayounsi) [12:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170607T1200). Please do the needful. [12:03:38] 10Operations, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10Patch-For-Review, 10User-Addshore: Deploy TwoColConflict extension to beta - https://phabricator.wikimedia.org/T154927#3323146 (10Tobi_WMDE_SW) [12:04:14] 10Operations, 10LDAP-Access-Requests, 10TCB-Team, 10Wikidata, 10Release-Engineering-Team (Kanban): Add Andrew and Aleksey to ldap/wmde group - https://phabricator.wikimedia.org/T152088#3323163 (10Tobi_WMDE_SW) [12:04:25] * aude waves [12:04:29] 10Operations, 10Electron-PDFs, 10TCB-Team, 10Patch-For-Review, 10User-Addshore: Deploy ElectronPdfService Extension to beta cluster - https://phabricator.wikimedia.org/T150945#3323176 (10Tobi_WMDE_SW) [12:04:31] 10Operations, 10Electron-PDFs, 10TCB-Team, 10Patch-For-Review, 10User-Addshore: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#3323179 (10Tobi_WMDE_SW) [12:05:04] 10Operations, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10Patch-For-Review, 10User-Addshore: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184#3323192 (10Tobi_WMDE_SW) [12:06:41] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#3323239 (10Tobi_WMDE_SW) [12:11:39] (03CR) 10Aude: [C: 032] Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (for beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357532 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:12:40] (03Merged) 10jenkins-bot: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (for beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357532 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:12:53] (03CR) 10jenkins-bot: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (for beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357532 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:14:20] 10Operations, 10Patch-For-Review: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094#3323294 (10MoritzMuehlenhoff) This has also independantly been reported at Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864341 [12:15:29] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:29] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:29] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:29] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:29] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:30] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:30] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:31] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:31] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:32] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:32] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:33] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:33] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:34] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:16:46] testing my change on mwdebug1002 [12:17:21] /away [12:17:26] arggg [12:17:29] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:17:30] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:30] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:17:30] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [12:17:30] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:17:30] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:30] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:17:31] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:17:31] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:32] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:17:32] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:17:33] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:33] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:34] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:18:19] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:18:29] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:19:58] !log aude@tin Synchronized wmf-config/Wikibase-labs.php: Site links for non-main namespace wiktionary pages (duration: 00m 44s) [12:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:15] jouncebot: refresh [12:25:17] I refreshed my knowledge about deployments. [12:25:18] jouncebot: next [12:25:18] In 0 hour(s) and 34 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170607T1300) [12:26:04] !log aude@tin Synchronized wmf-config/Wikibase.php: Site links for non-main namespace wiktionary pages T158323 (duration: 00m 43s) [12:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:13] T158323: enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) - https://phabricator.wikimedia.org/T158323 [12:26:22] (03PS1) 10Elukey: Test new zookeeper version on conf2002 [puppet] - 10https://gerrit.wikimedia.org/r/357590 [12:26:42] moritzm: --^ [12:26:44] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3323325 (10Marostegui) @MarcoAurelio thanks for the heads up!. We will make sure we have no schema maintenance scheduled for when you decide to run it. Ping any of... [12:27:12] 10Operations, 10Performance-Team, 10Thumbor: Package latest version of Thumbor and deploy it - https://phabricator.wikimedia.org/T167286#3323329 (10Gilles) [12:27:14] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3323342 (10Gilles) [12:27:23] (03CR) 10Aude: [C: 032] Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:28:07] hashar: i'd like to also deploy a hotfix https://gerrit.wikimedia.org/r/#/c/357530/ (to wmf4) [12:28:15] 10Operations, 10Performance-Team, 10Thumbor: Package latest version of Thumbor and deploy it - https://phabricator.wikimedia.org/T167286#3323357 (10Gilles) [12:28:17] 10Operations, 10Performance-Team, 10Thumbor: Backport python-schedule and add it to jessie-wikimedia - https://phabricator.wikimedia.org/T167287#3323344 (10Gilles) [12:28:22] 10Operations, 10Performance-Team, 10Thumbor: Backport python-schedule and add it to jessie-wikimedia - https://phabricator.wikimedia.org/T167287#3323344 (10Gilles) [12:28:22] maybe since i'm deploying other stuff now, can take care of it before swat [12:28:39] PROBLEM - Check Varnish expiry mailbox lag on cp1063 is CRITICAL: CRITICAL: expiry mailbox lag is 2038123 [12:28:45] aude: I am attending a tech talk right now [12:29:00] aude: but change looks good to me so please be bold!? [12:29:41] aude: note that wmf.4 is only on group0 for now [12:30:13] (03CR) 10Muehlenhoff: [C: 031] Test new zookeeper version on conf2002 [puppet] - 10https://gerrit.wikimedia.org/r/357590 (owner: 10Elukey) [12:30:19] yeah, the issue is on wmf4 only [12:30:34] still working on my changes for beta though [12:30:59] The .git directory is missing from extensions/Constraints/, see https://getcomposer.org/commit-deps for more information [12:31:00] :/ [12:31:20] https://gerrit.wikimedia.org/r/#/c/354522/ [12:31:39] we need to switch to composer install (at least for the wikidata build) to respect the lock file is there [12:31:54] ah [12:32:22] even that, i'm not 100% sure compsoser install won't have this specific issue [12:32:29] !log cp1072, cp1063 restarting varnish backend for mailbox lag [12:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:50] aude: I have noticed your patch to invoke "composer install" whenever a composer.lock is present. Guess I will have to review/test it [12:33:11] i think it works as composer update if there is no composer.lock [12:33:19] whenever you can look at it, that would be great [12:33:41] (03CR) 10Elukey: [C: 032] Test new zookeeper version on conf2002 [puppet] - 10https://gerrit.wikimedia.org/r/357590 (owner: 10Elukey) [12:34:13] (03PS2) 10Aude: Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) [12:34:22] (03CR) 10Aude: [C: 032] Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:35:45] (03Merged) 10jenkins-bot: Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:35:53] (03CR) 10jenkins-bot: Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:38:39] RECOVERY - Check Varnish expiry mailbox lag on cp1063 is OK: OK: expiry mailbox lag is 0 [12:40:00] !log upgrade zookeeper packages on conf2002 to 3.4.5+dfsg-2+deb8u2 [12:40:06] !log aude@tin Synchronized wmf-config/InitialiseSettings-labs.php: Enable Wikibase Client on beta wiktionary sites T158323 (duration: 00m 43s) [12:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:18] T158323: enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) - https://phabricator.wikimedia.org/T158323 [12:41:27] dcausse: jan_drewniak : I will not be available for SWAT sorry. [12:41:46] well I might, but I will be in some tech talk so it is going to be hard to handle it myself. [12:42:09] hashar: np, I can swat, no one from releng will be around? [12:43:19] !log Drop table updates on s2 - T139342 [12:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:30] T139342: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342 [12:43:39] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 0 [12:46:54] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3323381 (10jcrespo) Same recommendations apply than T167031#3315249 plus the additional of not making both at the same time. [12:48:33] (03PS1) 10Ema: base::kernel: create /etc/modules-load.d on Trusty systems [puppet] - 10https://gerrit.wikimedia.org/r/357591 [12:53:16] (03PS1) 10Jcrespo: mariadb: Remove old codfw db hosts from candidates for reimage [puppet] - 10https://gerrit.wikimedia.org/r/357592 [12:55:23] (03CR) 10Jcrespo: [C: 032] mariadb: Remove old codfw db hosts from candidates for reimage [puppet] - 10https://gerrit.wikimedia.org/r/357592 (owner: 10Jcrespo) [12:56:54] !log aude@tin Synchronized php-1.30.0-wmf.4/extensions/Wikidata: Fix parser function registration T167238 (duration: 02m 20s) [12:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:04] T167238: Fatal Error: "Tag hook for noexternallanglinks is not callable" (1.30.0-wmf.4/includes/parser/Parser.php) - https://phabricator.wikimedia.org/T167238 [12:57:29] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [12:58:21] (03CR) 10Muehlenhoff: [C: 031] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/357591 (owner: 10Ema) [12:58:32] 10Operations, 10netops: Rancid improvements - https://phabricator.wikimedia.org/T167288#3323402 (10ayounsi) [12:59:50] (03PS2) 10Ema: base::kernel: create /etc/modules-load.d on Trusty systems [puppet] - 10https://gerrit.wikimedia.org/r/357591 [13:00:00] (03CR) 10Ema: [V: 032 C: 032] base::kernel: create /etc/modules-load.d on Trusty systems [puppet] - 10https://gerrit.wikimedia.org/r/357591 (owner: 10Ema) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170607T1300). [13:00:05] dcausse and jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:01:54] Hey dcausse (for some reason I thought with swat was at 4pm) but can you handle a portal deploy? [13:02:04] o/ [13:02:17] jan_drewniak: never done that but I can try [13:03:03] 10Operations, 10netops: Rancid improvements - https://phabricator.wikimedia.org/T167288#3323427 (10faidon) Why not convert? I think there's a lot of value in doing so. Agreed on the rest. Moreover, it would be nice if we could filter the output and remove some of the known artifacts (cr2-ulsfo's TFEB -/+, LLD... [13:03:20] It's just cd'ing into the wikimedia-config repo and running the sync-portals script at the root of the repo [13:04:13] jan_drewniak: ok [13:04:38] dcausse but I might not be available to check, because I'm still waiting at the doctors office... so maybe this should be done later [13:04:55] jan_drewniak: as you want [13:05:18] maybe you can reschedule for sf morning? [13:05:20] (03PS1) 10Jcrespo: mariadb-query-killer: Reduce threshold for overload detection [software] - 10https://gerrit.wikimedia.org/r/357594 [13:05:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:05:29] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:30] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:30] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:30] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:30] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:30] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:31] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:31] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:32] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:32] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:33] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:33] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:34] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:42] I will silence that, those are the backups [13:06:17] <_joe_> marostegui: isn't there a way to know backups are running? [13:06:39] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:06:39] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:06:39] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:06:39] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:06:39] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:06:48] Other than looking at tendril not really :( [13:06:54] (Or ssh' ing to the host) [13:07:01] _joe_: the backups fixing is happening next quarter [13:07:09] Dcausse: Unfortunately I don't know how long I'll be waiting... yeah I'll reschedule for SF morning [13:07:13] <_joe_> okok I'll shut up :P [13:07:24] _joe_: you can see it at https://tendril.wikimedia.org/activity?root=0&wikiuser=0&research=0&labsusers=0 [13:07:49] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:07:53] <_joe_> uhm [13:07:56] <_joe_> ema ^^ [13:08:02] _joe_: I wasn't recriminating you, just explainign we know it is broken, and we plan tio fix it [13:08:07] <_joe_> maybe our scraping friend is back [13:08:09] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:08:17] <_joe_> jynus: yeah I was joking :) [13:08:19] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:08:20] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [13:08:20] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [13:08:20] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:08:20] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:08:20] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:08:20] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:08:21] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:08:21] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:08:22] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:08:22] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:08:23] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:08:23] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:08:24] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:09:26] there are spikes of errors in the last hour [13:09:29] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:09:41] _joe_: yup I was just looking at the logs on oxygen [13:09:53] I was right now putting a harder limit to the query killer [13:11:00] no specific user-agent stands out this time [13:13:40] <_joe_> ema: it's wikiscrape again [13:13:55] _joe_, jynus, ema: I'm going to SWAT a patch (wmf-config), no objections? [13:14:10] <_joe_> dcausse: wait a sec, we're in the middle of problems [13:14:14] dcausse: hold on a sec, there are some issues [13:14:16] ok [13:14:42] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3323439 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['lvs1007.eqiad.wmnet'] ``` The log can... [13:15:39] (03PS5) 10Alexandros Kosiaris: Refactor facts exporting to better cleanup facts [puppet] - 10https://gerrit.wikimedia.org/r/356814 [13:15:47] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Refactor facts exporting to better cleanup facts [puppet] - 10https://gerrit.wikimedia.org/r/356814 (owner: 10Alexandros Kosiaris) [13:20:04] (03PS2) 10Ema: VCL: basic support to return HTTP 429, apply it to UA:wikiScrape/0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/357557 [13:21:45] (03PS3) 10Ema: VCL: basic support to return HTTP 429, apply it to UA:wikiScrape/0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/357557 [13:21:54] (03CR) 10Ema: [V: 032 C: 032] VCL: basic support to return HTTP 429, apply it to UA:wikiScrape/0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/357557 (owner: 10Ema) [13:27:07] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3323492 (10elukey) @Cmjohnson do you have time during the next days to do a couple of hosts? [13:28:02] dcausse: jan_drewniak: I am more or less around [13:28:46] (03PS2) 10Faidon Liambotis: labstore: remove TC=$(which tc) [puppet] - 10https://gerrit.wikimedia.org/r/356107 [13:28:48] (03PS2) 10Faidon Liambotis: labstore: use the interface_primary fact, not eth0 [puppet] - 10https://gerrit.wikimedia.org/r/356108 [13:28:50] (03PS3) 10Faidon Liambotis: labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109 [13:28:52] (03PS1) 10Faidon Liambotis: labstore: use /sbin/tc, not $PATH/tc [puppet] - 10https://gerrit.wikimedia.org/r/357597 [13:29:12] hashar: there's an issue atm, if it's fixed before 4pm CET I'll swat my change, jan re-scheduled his patch for sf morning [13:29:15] (03PS3) 10Faidon Liambotis: labstore: remove TC=$(which tc) [puppet] - 10https://gerrit.wikimedia.org/r/356107 [13:29:17] (03PS3) 10Faidon Liambotis: labstore: use the interface_primary fact, not eth0 [puppet] - 10https://gerrit.wikimedia.org/r/356108 [13:29:19] (03PS4) 10Faidon Liambotis: labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109 [13:29:21] (03PS2) 10Faidon Liambotis: labstore: use /sbin/tc, not $PATH/tc [puppet] - 10https://gerrit.wikimedia.org/r/357597 [13:30:31] dcausse: sounds good. Thank you! [13:30:47] listening to talk about Pachyderm, a replacement for Hadoop/HDFS :} [13:31:32] I like the name :) [13:31:49] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:32:12] (03CR) 10Faidon Liambotis: "So this was doing $(which tc) before, so this change by itself is a non-functional change so I kept as it was and pushed a separate change" [puppet] - 10https://gerrit.wikimedia.org/r/356107 (owner: 10Faidon Liambotis) [13:35:49] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:35:53] (03PS1) 10Alexandros Kosiaris: Fix 2 issues in compiler-update-facts [puppet] - 10https://gerrit.wikimedia.org/r/357598 [13:36:59] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:37:02] (03CR) 10Alexandros Kosiaris: [C: 032] Fix 2 issues in compiler-update-facts [puppet] - 10https://gerrit.wikimedia.org/r/357598 (owner: 10Alexandros Kosiaris) [13:37:43] 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3323538 (10MoritzMuehlenhoff) [13:37:45] 10Operations, 10Continuous-Integration-Infrastructure: Collate jessie-wikimedia/backports into jessie-wikimedia/main - https://phabricator.wikimedia.org/T167292#3323526 (10MoritzMuehlenhoff) [13:37:58] 10Operations: Collate jessie-wikimedia/backports into jessie-wikimedia/main - https://phabricator.wikimedia.org/T167292#3323526 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:38:07] moritzm: wrong task? [13:38:31] meh, wrong tab. fixing [13:38:38] :) [13:38:39] dcausse: basically a replacement for hadoop based on k8s / a filesystem similar to git (a graph with commits) and docker to run analysis (buzz word bingo!!!) [13:38:59] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:39:31] 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3310890 (10MoritzMuehlenhoff) [13:39:33] 10Operations: Collate jessie-wikimedia/backports into jessie-wikimedia/main - https://phabricator.wikimedia.org/T167292#3323526 (10MoritzMuehlenhoff) [13:41:54] (03PS1) 10Aude: Don't enable Wikibase data access yet for beta wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357601 (https://phabricator.wikimedia.org/T158324) [13:42:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:42:59] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:43:12] dcausse: still doing swat? [13:43:20] dcausse: you can proceed [13:43:28] (03PS5) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 [13:43:31] aude: yes [13:43:34] jynus: thanks [13:43:34] ok [13:43:49] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:44:02] when you are done, i might want to add something [13:44:02] (03CR) 10DCausse: [C: 032] [cirrus] Enable crossproject search on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276) (owner: 10DCausse) [13:44:09] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:44:11] 10Operations, 10Labs, 10cloud-services-team (Kanban): Initial OpenStack Neutron PoC deployment in Labtest - https://phabricator.wikimedia.org/T153099#3323580 (10chasemp) [13:44:12] aude: sure [13:44:15] or might be able to do it later swat [13:45:06] <_joe_> dcausse: we're out of the woods, you can go on [13:45:24] _joe_: thanks, swating [13:48:27] (03PS2) 10Andrew Bogott: designate.conf: Update the keystone_authtoken section [puppet] - 10https://gerrit.wikimedia.org/r/357512 [13:51:39] (03PS4) 10DCausse: [cirrus] Enable crossproject search on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276) [13:53:58] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3323647 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['lvs1007.eqiad.wmnet'] ``` The log can... [13:54:09] (03CR) 10Andrew Bogott: [C: 032] designate.conf: Update the keystone_authtoken section [puppet] - 10https://gerrit.wikimedia.org/r/357512 (owner: 10Andrew Bogott) [13:55:49] (03CR) 10jenkins-bot: [cirrus] Enable crossproject search on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276) (owner: 10DCausse) [13:57:55] (03PS1) 10Faidon Liambotis: autoinstall: avoid hardcoding eth0 [puppet] - 10https://gerrit.wikimedia.org/r/357608 (https://phabricator.wikimedia.org/T164444) [13:57:57] (03PS1) 10Faidon Liambotis: Switch stretch to predictable network interface names [puppet] - 10https://gerrit.wikimedia.org/r/357609 (https://phabricator.wikimedia.org/T158429) [13:58:11] godog, moritzm ^ [14:00:02] (03PS1) 10Filippo Giunchedi: grafana: unhardcode eth0 from server-board [puppet] - 10https://gerrit.wikimedia.org/r/357611 (https://phabricator.wikimedia.org/T164444) [14:00:23] (03CR) 10Faidon Liambotis: "This hasn't been tested with a new install but was tested locally with:" [puppet] - 10https://gerrit.wikimedia.org/r/357608 (https://phabricator.wikimedia.org/T164444) (owner: 10Faidon Liambotis) [14:01:31] (03PS1) 10DCausse: Revert "[cirrus] Enable crossproject search on all wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357612 [14:01:51] (03PS2) 10Faidon Liambotis: grafana: unhardcode eth0 from server-board [puppet] - 10https://gerrit.wikimedia.org/r/357611 (owner: 10Filippo Giunchedi) [14:02:11] aude: I'm going to revert my patch, it does not work as expected, I cleanup tin and mwdebug1002 and you can proceed [14:03:11] 10Operations, 10Patch-For-Review: Puppet: test non stringified facts across the fleet - https://phabricator.wikimedia.org/T166372#3323700 (10faidon) 05Open>03Resolved Structured facts were enabled across both production and Labs realms and seem to have been working fine :) We even converted a few facts of... [14:03:16] (03CR) 10Faidon Liambotis: [V: 032 C: 032] grafana: unhardcode eth0 from server-board [puppet] - 10https://gerrit.wikimedia.org/r/357611 (owner: 10Filippo Giunchedi) [14:03:23] dcausse: [14:03:24] ok [14:03:38] i can reschedule mine for later swat [14:03:51] (it's not urgent at all) [14:04:13] (03CR) 10Alexandros Kosiaris: [C: 032] nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 (owner: 10Hashar) [14:04:15] (03PS4) 10Alexandros Kosiaris: nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 (owner: 10Hashar) [14:05:36] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3323729 (10BBlack) [14:05:38] 10Operations, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3323716 (10BBlack) [14:05:53] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3323731 (10BBlack) [14:06:36] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357612 (owner: 10DCausse) [14:07:16] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/357608 (https://phabricator.wikimedia.org/T164444) (owner: 10Faidon Liambotis) [14:07:30] (03CR) 10Faidon Liambotis: [C: 032] autoinstall: avoid hardcoding eth0 [puppet] - 10https://gerrit.wikimedia.org/r/357608 (https://phabricator.wikimedia.org/T164444) (owner: 10Faidon Liambotis) [14:07:35] (03PS2) 10Faidon Liambotis: autoinstall: avoid hardcoding eth0 [puppet] - 10https://gerrit.wikimedia.org/r/357608 (https://phabricator.wikimedia.org/T164444) [14:07:46] (03CR) 10Faidon Liambotis: [V: 032 C: 032] autoinstall: avoid hardcoding eth0 [puppet] - 10https://gerrit.wikimedia.org/r/357608 (https://phabricator.wikimedia.org/T164444) (owner: 10Faidon Liambotis) [14:08:07] (03CR) 10Filippo Giunchedi: [C: 031] Switch stretch to predictable network interface names [puppet] - 10https://gerrit.wikimedia.org/r/357609 (https://phabricator.wikimedia.org/T158429) (owner: 10Faidon Liambotis) [14:08:11] 10Operations: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3323741 (10faidon) [14:08:13] 10Operations, 10Patch-For-Review: Switch to predictable network interface names? - https://phabricator.wikimedia.org/T158429#3323742 (10faidon) [14:08:15] 10Operations, 10Patch-For-Review: Installer assumes eth0 is the used interface - https://phabricator.wikimedia.org/T164444#3323738 (10faidon) 05Open>03Resolved a:03faidon Fixed, as above. [14:09:23] 10Operations, 10Patch-For-Review: Switch to predictable network interface names? - https://phabricator.wikimedia.org/T158429#3036728 (10faidon) Last call for objections, there is a change staged and its merge is imminent :) [14:09:48] (03Merged) 10jenkins-bot: Revert "[cirrus] Enable crossproject search on all wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357612 (owner: 10DCausse) [14:09:57] (03CR) 10jenkins-bot: Revert "[cirrus] Enable crossproject search on all wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357612 (owner: 10DCausse) [14:11:37] !log eu swat done [14:11:42] 10Operations, 10Traffic: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3190763 (10ema) We should evaluate [[ https://github.com/varnish/varnish-modules/blob/master/docs/vmod_vsthrottle.rst | vmod_vsthrottle ]], available in the `varnish-modules` package and see if it's... [14:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:29] (03PS2) 10Muehlenhoff: Use new repository layout for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) [14:13:29] (03CR) 10jerkins-bot: [V: 04-1] Use new repository layout for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [14:15:41] (03PS1) 10Muehlenhoff: Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/357616 (https://phabricator.wikimedia.org/T158583) [14:18:02] (03PS3) 10Muehlenhoff: Use new repository layout for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) [14:18:37] (03PS1) 10Ema: check_ipmi_temp: load ipmi_devintf on trusty [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) [14:21:05] (03PS1) 10Faidon Liambotis: labs bootstrapvz/vmbuilder: avoid hardcoding eth0 [puppet] - 10https://gerrit.wikimedia.org/r/357618 [14:21:58] (03CR) 10jerkins-bot: [V: 04-1] Use new repository layout for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [14:23:58] 10Operations, 10Monitoring, 10Patch-For-Review: internal IPMI error - https://phabricator.wikimedia.org/T167121#3323795 (10jcrespo) db2042 is jessie, BTW. [14:25:31] elukey, _joe_: did you see https://phabricator.wikimedia.org/T167222#3321312 ? [14:26:10] TL;DR i think something in that puppet refactor is not quite right, and the two deployment-prep clusters are crossed with one another [14:26:18] (03PS4) 10Muehlenhoff: Use new repository layout for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) [14:26:49] urandom: hello! let me check [14:27:19] elukey: basically, the aqs nodes are listing the restbase nodes as seeds [14:27:45] elukey: normally we'd set the cluster name to some non-default value as a guard against this, but haven't in deployment-prep [14:27:46] this is probably due to a missing hiera config in deployment-prep [14:28:27] (03CR) 10Herron: [C: 031] check_ipmi_temp: load ipmi_devintf on trusty [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [14:28:30] i guess, the aqs env is inheriting profile:cassandra in hieradata/labs/deployment-prep/common.yaml? [14:30:05] urandom: definitely [14:30:29] 10Operations, 10Monitoring, 10Patch-For-Review: internal IPMI error - https://phabricator.wikimedia.org/T167121#3323824 (10ema) >>! In T167121#3323795, @jcrespo wrote: > db2042 is jessie, BTW. But it has an old BIOS version. Uh, nice catch. Indeed only some of the Jessie hosts have `ipmi_devintf` loaded (al... [14:30:30] (03PS1) 10Alexandros Kosiaris: puppet-compiler: Do the rsync using sudo [puppet] - 10https://gerrit.wikimedia.org/r/357619 [14:30:30] I have a meeting now but I'll try to fix it in a bit, is it ok? [14:30:33] urandom: --^ [14:30:38] elukey: ya; thanks! [14:30:54] (03CR) 10Faidon Liambotis: [C: 031] "LGTM, although I do wonder what loads it in jessie :)" [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [14:32:08] (03CR) 10Muehlenhoff: "kmod loads them and systemd ships the /etc/modules-load.d directory." [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [14:32:30] (03CR) 10Muehlenhoff: "Disregard my comment...." [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [14:34:43] (03CR) 10Faidon Liambotis: [C: 031] Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/357616 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [14:35:25] moritzm: re: https://gerrit.wikimedia.org/r/#/c/357559 - I think this won't work [14:35:41] as the repository won't precede the require_package('megacli') I think [14:37:49] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:38:10] eh [14:39:14] bblack: is it you? ^^^ [14:39:23] or is it icinga? [14:39:35] that is the question :-P [14:39:40] (03PS2) 10Ema: check_ipmi_temp: load ipmi_devintf [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) [14:40:48] (03CR) 10Ema: "> LGTM, although I do wonder what loads it in jessie :)" [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [14:40:52] 10Operations, 10Analytics, 10Analytics-Cluster, 10Traffic, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3323871 (10Ottomata) We should do some work to understand how ACLs work and what ACLs for what topics we should set in production. [14:41:33] ema: I think some ipmi userspace tool does this [14:42:53] volans: yeah it's me, downtime expired [14:43:32] paravoid: good point. we could also add the hwraid repo in the raid class and have the megacli/arcconf etc. dependencies on the presence of the repository [14:43:59] paravoid: that might very well be :) [14:44:17] moritzm: yeah I thought about that too [14:44:33] moritzm: dunno, I'm a little ambivalent about it :) [14:44:54] moritzm: the rest are fairly generic, while the repo is a bit more wikimedia distro-specific [14:45:05] <_joe_> I thought apt::conf was tied to stage first? [14:45:09] <_joe_> let me check again [14:45:16] apt::repository you mean? [14:45:18] I think no [14:45:29] oh also, modules/labs_bootstrapvz/files/labs-*.manifest.yaml list megacli for some weird reason [14:45:53] <_joe_> no it's not [14:46:26] <_joe_> yeah, probably tying the apt class to stage first would be enough [14:46:42] <_joe_> I mean in general [14:48:59] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1007.codfw.wmnet [14:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:38] paravoid: but even if someone were to reuse our raid class on a plain Debian system, they'd need our repo to get the proprietary debs? so it's still mostly generic, doesn't really matter whether anyone fetches them from le-vert or our mirror [14:49:43] weird conftool accepted that invalid hostname [14:50:37] 10Operations, 10Traffic: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3323972 (10BBlack) re: vsthrottle, my thoughts after a quick look this morning: 1. The burst issue seems fine. It initializes fresh buckets with full capacity. 2. Memory leaks - it seems to have pr... [14:50:40] godog: i wonder if conftool checks to see if the host exists [14:50:59] _joe_: we've tried that before, it was a mess [14:51:07] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2007.codfw.wmnet [14:51:13] (03CR) 10Andrew Bogott: [C: 032] labs bootstrapvz/vmbuilder: avoid hardcoding eth0 [puppet] - 10https://gerrit.wikimedia.org/r/357618 (owner: 10Faidon Liambotis) [14:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:27] <_joe_> paravoid: damn puppet [14:51:53] a14a2e38ea67668215715f05e5e2ac73cd76f035 [14:51:57] 7909d4c07cc3d5d69ed5bbf692d05c7671ea999a [14:52:09] basically [14:52:28] if you have class A { apt::repository { 'foo': } } [14:52:42] then class B requiring class A or a resource from class A [14:52:56] and then if you put apt::repository on a first Stage [14:53:01] 10Operations, 10netops: ospf link-protection - https://phabricator.wikimedia.org/T167306#3323984 (10ayounsi) [14:53:03] you've created a dependency loop [14:53:49] class['B'] -> class ['A'] -> Stage['first'] -> class ['B'] [14:54:09] er, sorry [14:54:17] something along these lines anyway [14:56:07] (03PS2) 10Jcrespo: mariadb-query-killer: Reduce threshold for overload detection [software] - 10https://gerrit.wikimedia.org/r/357594 [14:56:11] (03PS1) 10Jcrespo: mariadb: Improve systemd and package management [software] - 10https://gerrit.wikimedia.org/r/357626 (https://phabricator.wikimedia.org/T116903) [14:56:39] (03CR) 10Jcrespo: [C: 032] mariadb-query-killer: Reduce threshold for overload detection [software] - 10https://gerrit.wikimedia.org/r/357594 (owner: 10Jcrespo) [14:57:09] (03PS2) 10Jcrespo: mariadb: Improve systemd and package management [software] - 10https://gerrit.wikimedia.org/r/357626 (https://phabricator.wikimedia.org/T116903) [14:58:28] jynus: these control files are really, really wrong :) [14:59:56] 10Blocked-on-Operations, 10Puppet, 10Reading-Infrastructure-Team-Backlog, 10Sentry, and 3 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#3324125 (10Fjalapeno) [15:00:53] something is messing up my files [15:01:31] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Grant sudo access for Bryan Davis for labstore* and labsdb* - https://phabricator.wikimedia.org/T166310#3324137 (10chasemp) If we want to holdoff on the labsdb root inclusion I am going to propose in the opsen meeting this task become: * root on labst... [15:03:29] base is such a mess [15:04:54] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3324142 (10fgiunchedi) I've upgraded all ms-fe2* to swift 2.10, the trusty -> stretch conversion of ms-be2* is ongoing. Regardless of the latter I think we coul... [15:05:00] 10Operations, 10Monitoring, 10Services (next), 10User-Joe, 10User-mobrovac: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3324143 (10Joe) In order to do that, I want to do a local nrpe check on the cache edge servers, calling the SSL terminator, so that we cover as many lo... [15:05:32] (03PS1) 10Faidon Liambotis: labs_bootstrapvz: don't install HW RAID tools [puppet] - 10https://gerrit.wikimedia.org/r/357627 [15:08:31] 10Operations, 10Monitoring, 10Services (next), 10User-Joe, 10User-mobrovac: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3315673 (10faidon) Why not from the Icinga host itself like we do with all high-level LVS checks? [15:09:03] (03CR) 10Andrew Bogott: [C: 032] labs_bootstrapvz: don't install HW RAID tools [puppet] - 10https://gerrit.wikimedia.org/r/357627 (owner: 10Faidon Liambotis) [15:09:08] (03PS2) 10Andrew Bogott: labs_bootstrapvz: don't install HW RAID tools [puppet] - 10https://gerrit.wikimedia.org/r/357627 (owner: 10Faidon Liambotis) [15:10:09] 10Operations, 10Labs, 10Patch-For-Review: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203#3324156 (10Volans) 05Open>03Resolved a:03Volans Facter is upgraded in production on the whole fleet apart `cp3003.esams.wmnet,labstore[1001-1002].eqiad.wmnet` that will need to be reima... [15:15:53] (03CR) 10Muehlenhoff: [C: 031] "+1 to doing this with the stretch release, since this way we'll spot potential role-specific problems piece by piece as we migrate to stre" [puppet] - 10https://gerrit.wikimedia.org/r/357609 (https://phabricator.wikimedia.org/T158429) (owner: 10Faidon Liambotis) [15:17:32] madhuvishy: pushed https://gerrit.wikimedia.org/r/#/c/357597/ btw [15:21:13] (03CR) 10Filippo Giunchedi: [C: 031] check_ipmi_temp: load ipmi_devintf [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [15:31:07] (03PS1) 10Anomie: Deploy TemplateStyles to Beta Labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357629 (https://phabricator.wikimedia.org/T133414) [15:32:12] 10Operations, 10Traffic: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3324322 (10BBlack) Stared at hashtable implementation some more, as well as the linux iptables hashlimit one (which I consider a sort of baseline canonical efficient implementation). The linux one i... [15:33:27] (03PS1) 10Giuseppe Lavagetto: Add assert_hostname to poolmanager instantiation. [software/service-checker] - 10https://gerrit.wikimedia.org/r/357631 [15:33:34] <_joe_> mobrovac: ^^ [15:34:49] wowo that was fast _joe_ [15:36:25] (03CR) 10Giuseppe Lavagetto: [C: 032] Add assert_hostname to poolmanager instantiation. [software/service-checker] - 10https://gerrit.wikimedia.org/r/357631 (owner: 10Giuseppe Lavagetto) [15:37:13] _joe_: euh why is the assignment in a try/except block? can http_host not be there? [15:37:45] (03PS3) 10BBlack: maps->upload: delete maps-specific things [puppet] - 10https://gerrit.wikimedia.org/r/352834 (https://phabricator.wikimedia.org/T164608) [15:37:46] <_joe_> mobrovac: for some derived checkers, yes [15:37:46] !log disable puppet on puppetmaster1001, depool rhodium for tests [15:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:04] <_joe_> mobrovac: say the logstash one, we're querying logstash, we don't need to set the http host [15:40:23] (03PS4) 10BBlack: maps->upload: delete maps-specific things [puppet] - 10https://gerrit.wikimedia.org/r/352834 (https://phabricator.wikimedia.org/T164608) [15:41:36] urandom: I think that we'd need to figure out what are the things that we need to keep separate between aqs and restbase for profile::cassandra::* and add a host level override in Wikitech/Horizon (I am reading https://wikitech.wikimedia.org/wiki/Puppet_Hiera#In_Labs and I don't see a better solution but I might be wront) [15:41:41] (wrong) [15:42:45] (03PS1) 10BBlack: remove maps geoip definition [dns] - 10https://gerrit.wikimedia.org/r/357632 (https://phabricator.wikimedia.org/T164608) [15:42:55] (03CR) 10Gergő Tisza: [C: 032] Deploy TemplateStyles to Beta Labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357629 (https://phabricator.wikimedia.org/T133414) (owner: 10Anomie) [15:43:29] elukey: host level? [15:44:11] elukey: how does roles factor into this? [15:44:13] urandom: yep, hiera settings for each hosts that are evaluated before the ones that you pointed out [15:44:21] (03Merged) 10jenkins-bot: Deploy TemplateStyles to Beta Labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357629 (https://phabricator.wikimedia.org/T133414) (owner: 10Anomie) [15:44:31] jynus: we discussed it before and I forgot, apologies for asking again [15:44:46] jynus: under which circumstances is this write cache raid policy hiera variable useful? [15:45:23] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Infrastructure-Team-Old (Don't use), and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3324396 (10Jdforrester-WMF) [15:45:27] urandom: basically as you said profile::cassandra::* values (the ones that you pointed out before) are now read by both AQS and Restbase since they use the same profile [15:45:45] urandom: in prod we have the role hiera lookups that keeps the two separates [15:45:47] (03CR) 10jenkins-bot: Deploy TemplateStyles to Beta Labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357629 (https://phabricator.wikimedia.org/T133414) (owner: 10Anomie) [15:46:09] elukey: is that structure reflect in labs too somehow? i don't see it. [15:46:41] urandom: as far as I can see from https://wikitech.wikimedia.org/wiki/Puppet_Hiera#In_Labs no, it is much simpler [15:47:06] urandom: this is why I was suggesting to add spefic AQS overrides for each aqs instance [15:47:20] ok [15:47:26] (03PS3) 10Jcrespo: mariadb: Improve systemd and package management [software] - 10https://gerrit.wikimedia.org/r/357626 (https://phabricator.wikimedia.org/T116903) [15:47:35] there will be some duplication but it is the best I can think of [15:48:08] <_joe_> elukey: just use prefix puppet? [15:48:18] <_joe_> so you can just do it once [15:48:26] (03PS1) 10BBlack: remove old maps reverse dns [dns] - 10https://gerrit.wikimedia.org/r/357634 (https://phabricator.wikimedia.org/T164608) [15:48:38] (03PS4) 10Jcrespo: mariadb: Improve systemd and package management [software] - 10https://gerrit.wikimedia.org/r/357626 (https://phabricator.wikimedia.org/T116903) [15:48:41] _joe_ I though it was deprecated/not-to-be-used [15:48:46] (03PS1) 10BBlack: lvs_service_ips: remove old maps IPs [puppet] - 10https://gerrit.wikimedia.org/r/357635 (https://phabricator.wikimedia.org/T164608) [15:48:50] (03CR) 10Faidon Liambotis: [C: 04-1] "These control files are really, really wrong. Also, why are they in operations/software like that?" [software] - 10https://gerrit.wikimedia.org/r/357626 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [15:49:07] paravoid: can you elaborate? [15:49:27] do you mean the variable itself or what it tries to do? [15:49:40] what it tries to do [15:49:57] or why isn't it sufficient to check for the difference between configuration and runtime in the raid controller [15:50:12] ok, now I understad what you mean [15:50:21] i.e. if the raid config is for writeback and for some reason (e.g. BBU failure) it's operating as writethrough [15:50:50] because it both checks for a temporary problem 8hw problem [15:50:57] and a config problem [15:51:21] I thought at first to check with the default policy, but there were many hosts with the wrong default policy [15:51:25] ugh [15:51:27] analytics had that [15:51:37] and it is quite likely to change [15:51:41] on hr reset [15:51:42] I haven't thought about it much, but I'm not sure if check_raid (and friends) is the right place for config checks [15:51:43] bios reset [15:51:45] etc [15:51:59] we need that check [15:52:06] and I will show you why [15:52:11] that's what we're discussing aren't we [15:52:34] perhaps what we need is a separate puppet declaration to *apply* a config, and have that in puppet [15:52:39] not in the check portion, though [15:55:07] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3324426 (10jcrespo) {F8399947} [15:55:12] 10Operations, 10MW-1.30-release-notes, 10Traffic, 10HTTPS, and 2 others: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3324427 (10fgiunchedi) [15:55:19] paravoid: this happens every week: https://phabricator.wikimedia.org/T166108#3324426 [15:55:53] which has the same result as today's incident [15:56:02] I understand the problem [15:56:05] I disagree about the solution [15:56:11] oh [15:56:19] Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU [15:56:22] Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU [15:56:25] this is something we should alert for [15:56:27] if you can code a throughput check [15:56:28] definitely [15:56:28] default != current [15:56:33] I will retire it [15:56:38] :-) [15:56:50] no [15:56:54] no? [15:56:55] we disagree thre [15:57:21] disagree on : "if default != current -> alert"? [15:57:24] for the hosts marked, WriteThorough == WriteTrhough would have the same outage consequences [15:57:31] and it would not alert [15:57:41] and that has happened as frequently [15:57:41] that's a separate tehing [15:57:43] *thing [15:58:05] there is right now analytics hosts with that issue [15:58:06] "default cache policy should be writeback" should probably be a puppet definition, not an alert [15:58:20] it's a config change, and like all config changes can be done by puppet [15:58:23] no, it cannot be changed in some cases [15:58:28] s/done/enforced/ :) [15:58:29] oh? [15:58:30] how come? [15:58:36] hw is broken [15:58:52] hw is broken alters the *current* policy [15:58:53] this is for 5+ year hardware that for some reason is critical still [15:58:59] not the default one [15:59:05] afaik at least? [15:59:06] urandom: afaics we just need to have different profile::cassandra::instances values right? [15:59:25] misconfiguration [16:00:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:00:00] if we alert on "hw config is different than default" [16:00:07] people will change the default [16:00:11] I'm proposing two different things: [16:00:13] and thta has happened [16:00:17] - alert on "current != default" [16:00:24] - enforce, via puppet, the desired default config [16:00:37] one is a runtime check, the other is a configuration change [16:00:38] that's all [16:00:38] elukey: i think so [16:00:39] I can do the first [16:00:46] if you tell me how to do the second [16:00:49] I'll fix both [16:01:19] urandom: all right another meeting, will try to fix it asap [16:01:26] and what will we do when puppet fails to change it? [16:03:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [16:07:48] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3324460 (10Papaul) [16:08:23] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3324461 (10Papaul) [16:10:28] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3324462 (10Papaul) [16:12:39] urandom: I should have fixed the hiera config, want to try on restbase hosts? [16:13:26] urandom: I am running puppet on aqs hosts without restarting cassandra, changes looks good [16:13:28] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Infrastructure-Team-Old (Don't use), and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3324466 (10Anomie) [16:13:46] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Infrastructure-Team-Old (Don't use), and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231032 (10Anomie) [16:14:13] elukey: checking... [16:14:28] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Infrastructure-Team-Old (Don't use), and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231032 (10Anomie) [16:15:02] (03CR) 10Jcrespo: "> These control files are really, really wrong. Also, why are they in" [software] - 10https://gerrit.wikimedia.org/r/357626 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [16:15:19] elukey: looks like there were no changes (which is good) [16:15:30] elukey: it was the aqs nodes that had the extra seed entries [16:16:14] urandom: super - what is the best way to proceed now? Re-create the clusters or restarting hosts one at the time? [16:16:30] (still finishing puppet runs) [16:16:34] elukey: good question [16:16:44] elukey: i've never seen this before, so i have no experience [16:16:49] restarting is a good first step [16:17:00] 10Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#3324483 (10Fjalapeno) [16:17:43] (03CR) 10Jcrespo: "https://phabricator.wikimedia.org/rOSOFe76c10bcb92af46f6ad026af9bebd096f87a83b0" [software] - 10https://gerrit.wikimedia.org/r/357626 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [16:18:08] elukey: but i'll be surprised if that is enough :) [16:18:35] elukey: we should restart the aqs nodes first, since they have the changed seeds list [16:18:45] urandom: yeah, one at the time, and see what changes [16:18:58] * urandom nods [16:19:21] urandom: really sorry for the mess, didn't think about deployment-prep :( [16:19:31] no worries [16:19:55] i never think about deployment-prep, and am always caught by suprise :/ [16:20:17] 10Operations, 10netops: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274#3324498 (10Papaul) @ayounsi Yes I can be available [16:23:39] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Infrastructure-Team-Old (Don't use), and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3324520 (10Jdlrobson) [16:23:59] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:25:23] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Infrastructure-Team-Old (Don't use), and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231032 (10Jdlrobson) 05duplicate>03Open Sorry phabricator fail. [16:25:32] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Infrastructure-Team-Old (Don't use), and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3324533 (10Jdlrobson) [16:25:53] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Infrastructure-Team-Old (Don't use), and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231032 (10Jdlrobson) [16:27:23] urandom: deployment-aqs02 cassandra[6066]: Error opening zip file or JAR manifest missing : /srv/deployment/prometheus/jmx_exporter/lib/jmx_prometheus_javaagent-0.8-SNAPSHOT.jar [16:27:35] does it need to get deployed somehow? [16:27:36] ugh [16:28:42] elukey: it's added in the extra args [16:28:44] a setting [16:28:51] ahh [16:29:00] i guess it was (properly) not added before, and now is being added [16:29:39] urandom: so do we need to add it for restbase in depl-prep too? [16:29:46] I mean the setting [16:30:15] well, i think it was added on the restbase hosts [16:30:20] it's been there [16:30:58] ah it is actually under profile::cassandra::settings [16:31:02] in common.yaml [16:31:41] so i guess we need to either deploy the jar to the aqs cluster, remove the setting from the aqs cluster, or remove it everywhere [16:32:19] elukey: oh wait [16:32:36] elukey: i am wrong [16:32:47] that is what happens when the git-fat doesn't hydrate the jar [16:32:56] s/the git-fat/git-fat/ [16:33:04] so it is deployed, just... broken [16:33:44] (03PS1) 10DCausse: [cirrus] Enable crossproject search on all wikipedias (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357641 (https://phabricator.wikimedia.org/T162276) [16:33:46] (03PS1) 10DCausse: [cleanup] remove old interwiki search config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357642 [16:34:32] urandom: on the aqs hosts I can see /srv/deployment/prometheus/jmx_exporter/lib/jmx_prometheus_javaagent-0.8-20170117.190412-1.jar for example [16:34:49] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[tshark],Package[scap] [16:35:08] 10Operations, 10MW-1.30-release-notes, 10Traffic, 10HTTPS, and 2 others: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3324616 (10aaron) Deploy was 00:18 May 26 UTC, and {F8400067} I can't discern an effect on api upload entry point runtime. [16:35:46] elukey: yeah [16:36:12] so the jar name is wrong... [16:36:39] elukey: it is: /srv/deployment/prometheus/jmx_exporter/lib/jmx_prometheus_javaagent-0.8-SNAPSHOT.jar [16:36:43] yep [16:36:48] elukey: it should be: /srv/deployment/prometheus/jmx_exporter/lib/jmx_prometheus_javaagent-0.8-20170117.190412-1.jar [16:36:59] i... don't even [16:37:18] so this means that if you restart restbase it will fail too no? [16:37:23] yhes [16:37:24] yes [16:37:45] elukey: totally unrelated to all of the rest of thiss [16:37:55] of course, moar fun [16:38:00] ¯\_(ツ)_/¯ [16:38:12] all right let me change it in puppet then [16:38:29] i guess this is another example of deployment-prep being forgotten when changes are made [16:38:37] probably by me here [16:41:06] (03PS1) 10Jcrespo: mariadb: Fix bug on query killer regex [software] - 10https://gerrit.wikimedia.org/r/357644 [16:42:02] 10Operations, 10ops-esams, 10Traffic: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3313139 (10RobH) warrany for lvs3001 ended on May 08, 2015 [16:43:14] 10Operations, 10Ops-Access-Requests: Request access to analytics-privatedata-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T167116#3324654 (10Dzahn) a:03Dzahn [16:44:59] (03PS1) 10Elukey: Fix cassandra's jmx_prometheus_javaagent jar path for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/357646 (https://phabricator.wikimedia.org/T167222) [16:45:12] urandom: --^ [16:47:00] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/357646 (https://phabricator.wikimedia.org/T167222) (owner: 10Elukey) [16:47:07] elukey: --^ :) [16:47:31] (03PS2) 10Jcrespo: mariadb: Fix bug on query killer regex [software] - 10https://gerrit.wikimedia.org/r/357644 [16:47:45] (03CR) 10Elukey: [C: 032] Fix cassandra's jmx_prometheus_javaagent jar path for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/357646 (https://phabricator.wikimedia.org/T167222) (owner: 10Elukey) [16:53:38] urandom: mmmm nothing changes [16:54:02] !? [16:54:36] /etc/cassandra/cassandra-env.sh doesn't change after the puppet run [16:54:49] does it in restbase? [16:54:54] elukey: did the changeset not get merged on the puppet master or something? [16:55:01] elukey: no, it doesn't it [16:55:24] err... it doesn't [16:55:29] ah the deployment-prep puppet master! [16:55:33] it needs to sync [16:55:59] it does it periodically [16:56:05] so in a bit the change should be ready [16:56:12] kk [16:58:04] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Grant sudo access for Bryan Davis for labstore* and labsdb* - https://phabricator.wikimedia.org/T166310#3324730 (10chasemp) >>! In T166310#3324137, @chasemp wrote: > If we want to holdoff on the labsdb root inclusion I am going to propose in the opsen... [16:58:30] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Grant root access for Bryan Davis on labstore* and admin for maintain scripts for labsdb* - https://phabricator.wikimedia.org/T166310#3324734 (10chasemp) [16:59:09] elukey: i have to go afk for about 10-15 mins, i'll try it again after [16:59:21] elukey: not sure how often it syncs... [16:59:42] ack! [17:01:49] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3324762 (10Papaul) [17:01:52] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw: labtestpuppetmaster2001 switch port configuration - https://phabricator.wikimedia.org/T167321#3324750 (10Papaul) [17:01:59] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:03:22] (03PS3) 10Elukey: beta: profile::cassandra::allow_analytics: false [puppet] - 10https://gerrit.wikimedia.org/r/357344 (owner: 10Hashar) [17:03:29] (03CR) 10Elukey: [C: 031] beta: profile::cassandra::allow_analytics: false [puppet] - 10https://gerrit.wikimedia.org/r/357344 (owner: 10Hashar) [17:04:29] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:39] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:04:39] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:04:49] (03PS4) 10Rush: WMCS: add access for bryan davis [puppet] - 10https://gerrit.wikimedia.org/r/355463 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis) [17:05:29] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:05:39] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3324790 (10Papaul) [17:05:42] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw:labtestnet2002 switch port configuration - https://phabricator.wikimedia.org/T167322#3324778 (10Papaul) [17:05:50] (03CR) 10jerkins-bot: [V: 04-1] WMCS: add access for bryan davis [puppet] - 10https://gerrit.wikimedia.org/r/355463 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis) [17:06:19] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:06:39] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [17:06:39] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [17:07:28] (03PS5) 10Rush: WMCS: add access for bryan davis [puppet] - 10https://gerrit.wikimedia.org/r/355463 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis) [17:08:21] on thumbor I can see Memory cgroup out of memory: Kill process 39063 (convert) score 439 or sacrifice child [17:10:51] gilles, godog --^ [17:11:26] elukey: those are expected, some images blow up memory usage, that's what the cgroup memory limits are for [17:12:05] ah good, thanks for the explanation, every time I run dmesg on thumbor it seems a mine field :) [17:12:34] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3324856 (10Papaul) [17:12:36] nutcracker still shows the weird issue [17:12:38] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw: labtestneutron2002 sswitch port configuration - https://phabricator.wikimedia.org/T167326#3324843 (10Papaul) [17:12:44] (03CR) 10Jcrespo: [C: 031] WMCS: add access for bryan davis [puppet] - 10https://gerrit.wikimedia.org/r/355463 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis) [17:13:25] gilles: I can see a lot of connections in established to nutcracker [17:13:32] elukey: unfortunately most image conversion tools don't have memory/time limit options, so they can't fail gracefully [17:13:37] elukey: that's being addressed [17:13:48] ok super, I'll just restart nutcracker then [17:13:57] elukey: https://phabricator.wikimedia.org/rTHMBREXTc9b1461aec91df83dcc74126de32177844dead2c not deployed yet [17:14:30] c [17:14:35] I'm probably going to bundle it with something else before we deploy a package upgrade [17:14:59] !log restart nutcracker on thumbor1002 (too many connections approaching the 1024 ulimit) [17:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:39] back to the original issue after the restart of course [17:15:46] right [17:16:15] essentially it doesn't blow up that often because commands like convert get it to die regularly on broken files [17:16:29] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Infrastructure-Team-Old (Don't use), and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3324871 (10Jdforrester-WMF) [17:16:38] (03PS6) 10Rush: WMCS: add access for bryan davis [puppet] - 10https://gerrit.wikimedia.org/r/355463 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis) [17:16:40] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Infrastructure-Team-Old (Don't use), and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231032 (10Jdforrester-WMF) [17:19:59] (03PS2) 10Alexandros Kosiaris: puppet-compiler: Do the rsync using sudo [puppet] - 10https://gerrit.wikimedia.org/r/357619 [17:20:10] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppet-compiler: Do the rsync using sudo [puppet] - 10https://gerrit.wikimedia.org/r/357619 (owner: 10Alexandros Kosiaris) [17:20:50] * volans was about to +1, didn't had the time :) [17:22:20] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:25:29] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:32:27] urandom: there was a rebase problem on the deployment-prep puppet master, fixed it, let's restart the work :D [17:32:47] elukey: :) [17:33:03] all right finally it gets picked up :D [17:33:16] elukey: yup, i see it [17:34:22] urandom: org.apache.cassandra.db.UnknownColumnFamilyException: Got slice command for nonexistent table local_group_default_T_parsoid_html.data [17:34:27] lol [17:34:56] it is warning though [17:35:32] now the issue seems to be Caused by: java.io.FileNotFoundException: /etc/cassandra/jmx_exporter.yaml (No such file or directory) [17:37:06] ah! /etc/cassandra/prometheus_jmx_exporter.yaml [17:37:32] (03CR) 10Dzahn: [C: 031] WMCS: add access for bryan davis [puppet] - 10https://gerrit.wikimedia.org/r/355463 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis) [17:37:37] elukey: yeah :/ [17:37:55] the other warning is probably because the clusters are crossed [17:38:05] we maybe to to reinit them [17:38:13] wow. [17:38:21] we may have to reinit them, even [17:38:49] (03PS1) 10Elukey: Fix cassandra's jmx_prometheus_javaagent config path for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/357649 (https://phabricator.wikimedia.org/T167222) [17:39:47] (03CR) 10Elukey: [V: 032 C: 032] Fix cassandra's jmx_prometheus_javaagent config path for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/357649 (https://phabricator.wikimedia.org/T167222) (owner: 10Elukey) [17:39:55] (03PS2) 10Elukey: Fix cassandra's jmx_prometheus_javaagent config path for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/357649 (https://phabricator.wikimedia.org/T167222) [17:40:14] (03CR) 10Elukey: [V: 032 C: 032] Fix cassandra's jmx_prometheus_javaagent config path for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/357649 (https://phabricator.wikimedia.org/T167222) (owner: 10Elukey) [17:40:50] 10Operations, 10Mail: Increase email log retention period for the main email relays - https://phabricator.wikimedia.org/T167333#3325007 (10faidon) [17:51:24] urandom: now it finally works :D [17:51:29] running puppet on all the aqs nodes [17:52:09] (03PS1) 10Dzahn: admins: add Goran Milovanovic to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/357650 (https://phabricator.wikimedia.org/T167116) [17:52:23] (03PS3) 10Krinkle: Enable wgUsejQueryThree in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) [17:52:25] (03PS1) 10Andrew Bogott: Glance: Update our keystone_authtoken section [puppet] - 10https://gerrit.wikimedia.org/r/357651 (https://phabricator.wikimedia.org/T165211) [17:52:32] (03CR) 10jerkins-bot: [V: 04-1] Enable wgUsejQueryThree in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [17:53:40] (03CR) 10jerkins-bot: [V: 04-1] Glance: Update our keystone_authtoken section [puppet] - 10https://gerrit.wikimedia.org/r/357651 (https://phabricator.wikimedia.org/T165211) (owner: 10Andrew Bogott) [17:54:19] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:54:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [17:55:12] (03PS4) 10Krinkle: Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) [17:55:21] (03CR) 10jerkins-bot: [V: 04-1] Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [17:55:47] (03CR) 10Krinkle: "With all but one blocker for T124742 resolved, and no new ones discovered for a while, I think we're ready for a wikitech-l announcement a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [17:56:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [17:57:07] (03PS5) 10Krinkle: Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) [17:57:37] urandom: weird enough nodetool status shows up restbase nodes [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170607T1800). Please do the needful. [18:00:05] jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:38] (03PS2) 10Andrew Bogott: Glance: Update our keystone_authtoken section [puppet] - 10https://gerrit.wikimedia.org/r/357651 (https://phabricator.wikimedia.org/T165211) [18:00:40] (03PS1) 10Andrew Bogott: Glance: Remove a glance config file [puppet] - 10https://gerrit.wikimedia.org/r/357652 [18:02:28] (03CR) 10Andrew Bogott: [C: 032] Glance: Remove a glance config file [puppet] - 10https://gerrit.wikimedia.org/r/357652 (owner: 10Andrew Bogott) [18:02:52] elukey: yeah :( [18:02:53] is SWAT happening? [18:03:33] SMalyshev: yes [18:04:14] (03PS4) 10Aaron Schulz: Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) [18:04:24] (03CR) 10Andrew Bogott: [C: 032] Glance: Update our keystone_authtoken section [puppet] - 10https://gerrit.wikimedia.org/r/357651 (https://phabricator.wikimedia.org/T165211) (owner: 10Andrew Bogott) [18:05:06] I can SWAT [18:05:16] jan_drewniak: ping for SWAT [18:05:21] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Grant root access for Bryan Davis on labstore* and admin for maintain scripts for labsdb* - https://phabricator.wikimedia.org/T166310#3292124 (10Dzahn) This proposal by chasemp right above has been approved in ops meeting today. [18:05:29] o/ [18:05:46] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357490 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:05:52] (03PS7) 10Dzahn: WMCS: add access for bryan davis [puppet] - 10https://gerrit.wikimedia.org/r/355463 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis) [18:06:12] thcipriani: present o/ [18:06:20] jan_drewniak: hello :) [18:07:07] (03Merged) 10jenkins-bot: Updating portals stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357490 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:07:17] (03CR) 10jenkins-bot: Updating portals stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357490 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:07:19] (03PS4) 10Thcipriani: Enable archive indexing on delete for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357236 (https://phabricator.wikimedia.org/T162302) (owner: 10Smalyshev) [18:07:58] (03CR) 10Dzahn: [C: 032] "approved in ops meeting, +1 from jynus too... merging then" [puppet] - 10https://gerrit.wikimedia.org/r/355463 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis) [18:08:02] urandom: updated the task with all the thing that I did, but now I gotta go :( Will restart working on it tomorrow! [18:08:03] anomie: looks like I got a beta patch of yours, I'll take care of it [18:08:36] jan_drewniak: portals update is live on mwdebug1002, check please [18:09:41] thcipriani: Is that https://gerrit.wikimedia.org/r/#/c/357633/ you're referring to? [18:10:17] (03CR) 10Dzahn: "on labstore1003: User[bd808]/User[bd808]/ensure: created ... etc" [puppet] - 10https://gerrit.wikimedia.org/r/355463 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis) [18:10:40] anomie: looks like this one, all labs changes nbd: https://gerrit.wikimedia.org/r/#/c/357629/ just a surprising fetch is all :) [18:11:02] thcipriani: looks good! [18:11:29] thcipriani: Ah, Gergő must have not deployed it after he merged it. If you don't mind though, https://gerrit.wikimedia.org/r/#/c/357629/ needs merging so it actually works in Beta Labs. [18:12:00] (03PS1) 10Andrew Bogott: Glance config: replace <%= @keystoneconfig["ldap_user_dn"] %> with 'novaadmin' [puppet] - 10https://gerrit.wikimedia.org/r/357653 [18:13:22] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:357490|Updating portals stats]] T128546 (duration: 00m 44s) [18:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:35] T128546: [Recurring Task] Update Wikipedia.org Portal and sister Wiki's statistics - https://phabricator.wikimedia.org/T128546 [18:13:43] (03CR) 10Andrew Bogott: [C: 032] Glance config: replace <%= @keystoneconfig["ldap_user_dn"] %> with 'novaadmin' [puppet] - 10https://gerrit.wikimedia.org/r/357653 (owner: 10Andrew Bogott) [18:14:07] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:357490|Updating portals stats]] T128546 (duration: 00m 44s) [18:14:15] ^ jan_drewniak live everywhere [18:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:33] anomie: sorry, what do you mean? https://gerrit.wikimedia.org/r/#/c/357629/ is merged [18:14:50] thcipriani: Sorry, copy-pasted the wrong URL. https://gerrit.wikimedia.org/r/#/c/357633/ [18:14:55] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357236 (https://phabricator.wikimedia.org/T162302) (owner: 10Smalyshev) [18:15:46] 10Operations, 10Traffic: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3325171 (10faidon) [18:15:52] (03Merged) 10jenkins-bot: Enable archive indexing on delete for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357236 (https://phabricator.wikimedia.org/T162302) (owner: 10Smalyshev) [18:16:00] (03CR) 10jenkins-bot: Enable archive indexing on delete for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357236 (https://phabricator.wikimedia.org/T162302) (owner: 10Smalyshev) [18:16:07] thcipriani: thanks! [18:16:09] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Grant root access for Bryan Davis on labstore* and admin for maintain scripts for labsdb* - https://phabricator.wikimedia.org/T166310#3325175 (10Dzahn) I confirmed user and group wmcs-roots has been created on: labstore1003 (for the root on labstore*... [18:17:00] (03PS4) 10Bearloga: Add Shiny Server module and Discovery Dashboards role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) [18:17:07] (03CR) 10Bearloga: Add Shiny Server module and Discovery Dashboards role/profile (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga) [18:17:12] thcipriani: thanks! [18:17:31] anomie: I may leave that one for folks who may have informed opinions about vendor master -- looks fine to me, but may need more review than I can give/unsure :/ [18:17:46] ok, thanks anyway [18:17:50] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Grant root access for Bryan Davis on labstore* and admin for maintain scripts for labsdb* - https://phabricator.wikimedia.org/T166310#3325182 (10Dzahn) 05Open>03Resolved a:03Dzahn @bd808 ^ this should be resolved now. all other hosts should work... [18:18:25] mutante: thanks :) [18:19:11] SMalyshev: your change is live on mwdebug1002, check please [18:19:16] bd808: welcome:) [18:20:02] testing [18:21:08] thcipriani: everything seems to be working fine, thanks! [18:21:20] SMalyshev: cool, syncing everywhere now [18:22:22] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:357236|Enable archive indexing on delete for select wikis]] T162302 (duration: 00m 47s) [18:22:28] ^ SMalyshev should be live now [18:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:32] T162302: Add archive index to wikis - https://phabricator.wikimedia.org/T162302 [18:31:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [18:32:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [18:36:12] Hello ! is it possible to deploy two patchs as part of the current swat window not yet planned on wikitech table ? [18:37:44] it's two IP throttle configurations, https://gerrit.wikimedia.org/r/357233 & https://gerrit.wikimedia.org/r/357510 [18:38:35] (03PS1) 10Andrew Bogott: Designate: further cleanup of keystone_authtoken config [puppet] - 10https://gerrit.wikimedia.org/r/357658 (https://phabricator.wikimedia.org/T165211) [18:38:37] (03PS1) 10Andrew Bogott: Remove references to keystone admin_token [puppet] - 10https://gerrit.wikimedia.org/r/357659 (https://phabricator.wikimedia.org/T165211) [18:48:50] (03CR) 10Jcrespo: [C: 032] "This seems to be working on enwiki API servers- will deploy fleet wide if we do not see anything weird." [software] - 10https://gerrit.wikimedia.org/r/357644 (owner: 10Jcrespo) [18:49:55] (03PS2) 10Faidon Liambotis: Switch stretch to predictable network interface names [puppet] - 10https://gerrit.wikimedia.org/r/357609 (https://phabricator.wikimedia.org/T158429) [18:50:05] (03CR) 10Faidon Liambotis: [C: 032] Switch stretch to predictable network interface names [puppet] - 10https://gerrit.wikimedia.org/r/357609 (https://phabricator.wikimedia.org/T158429) (owner: 10Faidon Liambotis) [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170607T1900). [19:20:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [19:29:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [19:42:01] (03PS1) 10Jdlrobson: Relaunch related pages A/B test to 98% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357666 (https://phabricator.wikimedia.org/T167310) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170607T2000). Please do the needful. [20:00:13] (03CR) 10Andrew Bogott: [C: 032] Designate: further cleanup of keystone_authtoken config [puppet] - 10https://gerrit.wikimedia.org/r/357658 (https://phabricator.wikimedia.org/T165211) (owner: 10Andrew Bogott) [20:00:16] no parsoid deploy today [20:00:19] (03PS2) 10Andrew Bogott: Designate: further cleanup of keystone_authtoken config [puppet] - 10https://gerrit.wikimedia.org/r/357658 (https://phabricator.wikimedia.org/T165211) [20:04:01] !log Preparing to deploy the MediaWiki train for group1 wikis, 1.30.0-wmf.4 refs T166829 [20:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:10] T166829: MW-1.30.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T166829 [20:05:10] MatmaRex: can you tell me the status of T167042 and T167216 ? I saw patches but the tasks are still open and blocking [20:05:10] T167216: $wgResourceModuleSkinStyles customizations are not being applied for Vector because MobileFrontend accidentally overrides them - https://phabricator.wikimedia.org/T167216 [20:05:11] T167042: [Regression pre-wmf.4] Vector's personal tools are appearing above OOjs UI's global overlay ... again - https://phabricator.wikimedia.org/T167042 [20:05:35] Just want to make sure before I go ahead with wmf.4 [20:05:59] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3330192 (10RobH) [20:06:02] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw:labtestnet2002 switch port configuration - https://phabricator.wikimedia.org/T167322#3330189 (10RobH) 05Open>03Resolved a:03RobH Done! [20:06:21] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3330193 (10RobH) [20:06:30] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3319490 (10RobH) [20:06:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [20:11:53] (03PS2) 10Andrew Bogott: Remove references to keystone admin_token [puppet] - 10https://gerrit.wikimedia.org/r/357659 (https://phabricator.wikimedia.org/T165211) [20:11:55] (03PS1) 10Andrew Bogott: Keystone: add keystone-paste.ini upstream files. [puppet] - 10https://gerrit.wikimedia.org/r/357672 (https://phabricator.wikimedia.org/T165211) [20:15:54] (03PS2) 10Andrew Bogott: Keystone: add keystone-paste.ini upstream files. [puppet] - 10https://gerrit.wikimedia.org/r/357672 (https://phabricator.wikimedia.org/T165211) [20:15:56] (03PS3) 10Andrew Bogott: Remove references to keystone admin_token [puppet] - 10https://gerrit.wikimedia.org/r/357659 (https://phabricator.wikimedia.org/T165211) [20:18:27] (03CR) 10Andrew Bogott: [C: 032] Keystone: add keystone-paste.ini upstream files. [puppet] - 10https://gerrit.wikimedia.org/r/357672 (https://phabricator.wikimedia.org/T165211) (owner: 10Andrew Bogott) [20:19:42] jynus: ping [20:22:29] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/etc/keystone/keystone-api.ini] [20:22:51] (03CR) 10RobH: [C: 031] admins: add Goran Milovanovic to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/357650 (https://phabricator.wikimedia.org/T167116) (owner: 10Dzahn) [20:23:18] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3330283 (10MarcoAurelio) I have time to do it now if it is a good moment @jcrespo / @Marostegui. [20:24:16] (03PS2) 10Dzahn: admins: add Goran Milovanovic to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/357650 (https://phabricator.wikimedia.org/T167116) [20:24:36] (03CR) 10Andrew Bogott: [C: 032] Keystone paste: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/357678 (owner: 10Andrew Bogott) [20:26:14] !log twentyafterfour@tin Synchronized php-1.30.0-wmf.4/extensions/MobileFrontend: Deploy 66ef9cbd7f3de2832154f97392c2418fb1cd56ec refs T167216 (duration: 00m 46s) [20:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:21] T167216: $wgResourceModuleSkinStyles customizations are not being applied for Vector because MobileFrontend accidentally overrides them - https://phabricator.wikimedia.org/T167216 [20:26:29] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:27:03] (03CR) 10Andrew Bogott: "I've verified on labtestcontrol2001 that this doesn't seem to break anything, at least anything to do with creating an instance via Horizo" [puppet] - 10https://gerrit.wikimedia.org/r/357659 (https://phabricator.wikimedia.org/T165211) (owner: 10Andrew Bogott) [20:27:20] (03CR) 10Dzahn: [C: 032] admins: add Goran Milovanovic to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/357650 (https://phabricator.wikimedia.org/T167116) (owner: 10Dzahn) [20:27:26] (03PS3) 10Dzahn: admins: add Goran Milovanovic to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/357650 (https://phabricator.wikimedia.org/T167116) [20:27:54] (03CR) 10Andrew Bogott: "@hashar, can you confirm that nodepool uses username/password auth and not the admin_token to talk to openstack services?" [puppet] - 10https://gerrit.wikimedia.org/r/357659 (https://phabricator.wikimedia.org/T165211) (owner: 10Andrew Bogott) [20:28:29] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [20:36:31] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Infrastructure-Team-Old (Don't use), and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3330316 (10Anomie) [20:40:56] (03CR) 10GoranSMilovanovic: "+1" [puppet] - 10https://gerrit.wikimedia.org/r/357650 (https://phabricator.wikimedia.org/T167116) (owner: 10Dzahn) [20:41:33] I'm getting this message on test.wikipedia.org: Sorry! This site is experiencing technical difficulties. [20:41:34] Try waiting a few minutes and reloading. [20:41:34] (Cannot access the database: Cannot access the database: Access denied for user 'wikiuser'@'10.64.%' to database '0' (10.64.16.24)) [20:42:06] dannyh: works for me... [20:42:41] transient failure? Does it work if you refresh the page? [20:42:51] twentyafterfour sorry, it's when I try to login [20:43:17] loads the pages fine, it just fails on login [20:43:59] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [20:44:17] dannyh: I'm able to log in also [20:44:22] so I'm not sure what's going on [20:44:49] twentyafterfour hmm, weird [20:44:56] I'll see if I can get someone else to reproduce [20:48:43] * greg-g tries [20:49:12] (03PS1) 1020after4: group1 wikis to 1.30.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357704 [20:49:14] (03CR) 1020after4: [C: 032] group1 wikis to 1.30.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357704 (owner: 1020after4) [20:49:45] it may be LoginNotify -- I'm trying to test logging in with the wrong password [20:50:33] dannyh: works for me, oh, I'll try with wrong password [20:51:12] "Incorrect password entered. Please try again. [20:51:32] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357704 (owner: 1020after4) [20:51:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [20:51:43] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357704 (owner: 1020after4) [20:52:24] (03CR) 10Jforrester: [C: 04-1] "Happy to get it landed, but let's wait until at least we've had a train go out successfully in the past three weeks. Right now it'd just a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [20:52:30] greg-g: I'm going ahead with group1 since the only remaining blocker (T167343) appears to be non-critical, at least for now [20:52:30] T167343: Unexpected general module "mediawiki.widgets.DateInputWidget" in styles queue. - https://phabricator.wikimedia.org/T167343 [20:53:11] twentyafterfour: It will flood logs, however. [20:53:15] A fix just landed. [20:53:19] Can be backported. [20:53:23] Krinkle: ok [20:53:45] I already cherry-picked the patch but I was waiting for someone who knows something about it to +1 at least [20:54:40] Krinkle: also, that patch doesn't look like it would resolve the log flooding I'm seeing in kibana... but I'm really not sure I understand what's going on with resource loader [20:54:41] twentyafterfour greg-g test wiki's working for me now, Max is looking at a bug -- sorry to bother you [20:55:12] twentyafterfour: Assuming we're talking about https://gerrit.wikimedia.org/r/#/c/357668/ - yes, it will fix T167343 [20:56:18] Krinkle: yeah that's the one [20:56:21] The rule in RL is that when a module exists, you can load it on a page. You can't load just some portion of it. Because css and js are different, styles-only modules can be loaded into a or using OutputPage::addModuleStyles(), and for dynamic JS stuff (which can also have css with it) you enqueue the module with OutputPage::addModules(). [20:57:07] If you call addModuleStyles() with a module that is a dynamic JS module, it will end up loading just the styles part of it, which is an error since th en that moduele should get marked as "loaded", but isn't really. This will be an exceptoin, once people stop re-introducing errors :) [20:57:39] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:57:54] so this will fix all the various "Unexpected general module "X" in styles queue." errors I'm seeing? [21:00:57] twentyafterfour: I'll fix the ones about "mediawiki.widgets.DateInputWidget" - which you're seeing mostly on test/test2wiki right now. [21:01:24] Which would've become a new warning, more common then all the others right now. [21:01:45] I'm seeing thousands of others though, 121,038 in 15 minutes [21:01:58] Yes, but wmf.4 is only on group0. [21:02:02] This is a regression. [21:02:05] Hence I made it a blocker :) [21:02:11] ok [21:02:17] Lower traffic on group0. [21:02:34] right I understand that part, just not sure why all these others aren't also blockers? [21:02:39] just because they are pre-existing? [21:03:01] twentyafterfour: Pre-existing indeed, and because they are relatively small portion of traffic only. [21:03:10] these warnings have been around for years without logging. [21:03:28] I enabled logging for them once a month ago, then immediately disabled. Identified the 99%, had them fixed and then re-enabled. [21:04:04] since then we're working off hte list, only 2 left (Math, and ProofreadPage). Math is fixed in wmf.4 I think. [21:04:07] (03PS5) 10BBlack: maps->upload: delete maps-specific things [puppet] - 10https://gerrit.wikimedia.org/r/352834 (https://phabricator.wikimedia.org/T164608) [21:04:08] And there is this new regression. [21:04:13] I support the plan to make them exceptions [21:04:29] which would've meant we had to disable logging again if we were to tolerate this regression. [21:04:29] due to it being too common [21:05:59] (03CR) 10BBlack: [C: 032] maps->upload: delete maps-specific things [puppet] - 10https://gerrit.wikimedia.org/r/352834 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [21:06:20] waiting for jenkins to merge the cherry-picked patch, then I'll resume deploying to group 1 [21:08:12] (03PS1) 10Faidon Liambotis: Fix whitespace-related Rubocop warnings across the tree [puppet] - 10https://gerrit.wikimedia.org/r/357715 [21:08:14] (03PS1) 10Faidon Liambotis: check_puppetrun: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357716 [21:08:16] (03PS1) 10Faidon Liambotis: wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 [21:08:18] (03PS1) 10Faidon Liambotis: puppetmaster: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357718 [21:08:20] (03PS1) 10Faidon Liambotis: hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719 [21:08:22] (03PS1) 10Faidon Liambotis: scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720 [21:08:24] (03PS1) 10Faidon Liambotis: rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721 [21:08:26] (03PS1) 10Faidon Liambotis: rubocop: update rubocop_todo for rubocop 0.48 [puppet] - 10https://gerrit.wikimedia.org/r/357722 [21:08:35] aka "anything but annual reviews" [21:09:59] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357718 (owner: 10Faidon Liambotis) [21:10:12] (03CR) 10jerkins-bot: [V: 04-1] wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 (owner: 10Faidon Liambotis) [21:10:36] (03CR) 10jerkins-bot: [V: 04-1] hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719 (owner: 10Faidon Liambotis) [21:10:47] (03CR) 10jerkins-bot: [V: 04-1] rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721 (owner: 10Faidon Liambotis) [21:11:01] (03CR) 10jerkins-bot: [V: 04-1] scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720 (owner: 10Faidon Liambotis) [21:11:05] oh fuck off [21:11:29] :-/ [21:11:31] paravoid: word (re procrastination) [21:11:31] (03CR) 10jerkins-bot: [V: 04-1] rubocop: update rubocop_todo for rubocop 0.48 [puppet] - 10https://gerrit.wikimedia.org/r/357722 (owner: 10Faidon Liambotis) [21:12:40] twentyafterfour: Thanks [21:13:59] (03PS2) 10Faidon Liambotis: wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 [21:14:01] (03PS2) 10Faidon Liambotis: puppetmaster: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357718 [21:14:03] (03PS2) 10Faidon Liambotis: hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719 [21:14:05] (03PS2) 10Faidon Liambotis: scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720 [21:14:07] (03PS2) 10Faidon Liambotis: rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721 [21:14:09] (03PS2) 10Faidon Liambotis: rubocop: update rubocop_todo for rubocop 0.48 [puppet] - 10https://gerrit.wikimedia.org/r/357722 [21:16:10] (03PS1) 10Herron: Adding logrotate template to set mail::mx exim log retention to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/357723 [21:16:25] (03CR) 10jerkins-bot: [V: 04-1] scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720 (owner: 10Faidon Liambotis) [21:16:27] Krinkle: thanks to you for educating me :) [21:16:47] (03CR) 10jerkins-bot: [V: 04-1] rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721 (owner: 10Faidon Liambotis) [21:17:59] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:18:06] (03CR) 10jerkins-bot: [V: 04-1] rubocop: update rubocop_todo for rubocop 0.48 [puppet] - 10https://gerrit.wikimedia.org/r/357722 (owner: 10Faidon Liambotis) [21:18:37] (03PS2) 10BBlack: lvs_service_ips: remove old maps IPs [puppet] - 10https://gerrit.wikimedia.org/r/357635 (https://phabricator.wikimedia.org/T164608) [21:19:06] (03PS3) 10Faidon Liambotis: scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720 [21:19:08] (03PS3) 10Faidon Liambotis: rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721 [21:19:22] (03PS3) 10Faidon Liambotis: rubocop: update rubocop_todo for rubocop 0.48 [puppet] - 10https://gerrit.wikimedia.org/r/357722 [21:19:55] (03CR) 10BBlack: [V: 032 C: 032] lvs_service_ips: remove old maps IPs [puppet] - 10https://gerrit.wikimedia.org/r/357635 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [21:21:10] (03Restored) 10Faidon Liambotis: Puppet compiler for Tim's redirects.dat DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [21:21:21] (03CR) 10jerkins-bot: [V: 04-1] rubocop: update rubocop_todo for rubocop 0.48 [puppet] - 10https://gerrit.wikimedia.org/r/357722 (owner: 10Faidon Liambotis) [21:22:38] (03CR) 10BBlack: [C: 032] remove maps geoip definition [dns] - 10https://gerrit.wikimedia.org/r/357632 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [21:22:48] (03CR) 10BBlack: [C: 032] remove old maps reverse dns [dns] - 10https://gerrit.wikimedia.org/r/357634 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [21:23:34] !log twentyafterfour@tin Synchronized php-1.30.0-wmf.4/: sync 3248a17f5529ab6d5cbabb75dd0e72a7589f8633 refs T167343 (duration: 07m 52s) [21:23:38] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:45] T167343: Unexpected general module "mediawiki.widgets.DateInputWidget" in styles queue. - https://phabricator.wikimedia.org/T167343 [21:25:22] (03PS8) 10Faidon Liambotis: Puppet compiler for Tim's redirects.dat DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [21:25:31] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.4 [21:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:47] hahaha this is amazing [21:26:07] I somehow reminded myself of this old commit by ori and wanted to clean it up and commit it [21:26:10] its date? [21:26:13] 2014-06-08 [21:26:14] 10Operations, 10Mail: Increase email log retention period for the main email relays - https://phabricator.wikimedia.org/T167333#3330487 (10herron) Change https://gerrit.wikimedia.org/r/#/c/357723 to increase exim log retention to 60 days is ready for review [21:26:18] that's crazy! [21:26:23] :) [21:26:38] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:27:21] ah those kafka puppet errors are probably me [21:28:55] (03PS1) 10BBlack: cache_maps: remove ipsec defs on kafka nodes [puppet] - 10https://gerrit.wikimedia.org/r/357725 (https://phabricator.wikimedia.org/T164608) [21:29:18] Catchable fatal error: Argument 2 passed to RevisionSliderHooks::onDiffViewHeader() must be an instance of Revision, null given in /srv/mediawiki/php-1.30.0-wmf.4/includes/Hooks.php on line 186 [21:29:42] Warning: in_array() expects parameter 2 to be an array or collection in /srv/mediawiki/php-1.30.0-wmf.4/extensions/Wikidata/extensions/Wikibase/lib/includes/Formatters/MwTimeIsoFormatter.php on line 109 [21:29:47] have you just updated mediawiki? because popups no longer work for me :/ [21:29:48] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [21:29:57] TabbyCat: yes [21:30:03] (03CR) 10BBlack: [V: 032 C: 032] cache_maps: remove ipsec defs on kafka nodes [puppet] - 10https://gerrit.wikimedia.org/r/357725 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [21:30:10] I think I should roll back [21:31:05] !log rolling back to wmf.2 due to error spike and popups no longer working refs T166829 [21:31:11] (03PS1) 1020after4: group1 wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357727 [21:31:13] (03CR) 1020after4: [C: 032] group1 wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357727 (owner: 1020after4) [21:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:16] T166829: MW-1.30.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T166829 [21:31:38] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:31:38] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [21:31:49] yeah pending from icinga too (but not 3/3 to alert yet): PYBAL CRITICAL - api-https_443 - Could not depool server mw1206.eqiad.wmnet because of too many down!: swift-https_443 - Could not depool server ms-fe1008.eqiad.wmnet because of too many down!: appservers-https_443 - Could not depool server mw1172.eqiad.wmnet because of too many down!: wdqs_80 - Could not depool server wdqs1003.eqiad.wmn [21:31:55] et because of too many down!: rendering-https_443 - Could not depool server mw1298.eqiad.wmnet because of too many down!: thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down! [21:32:08] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [21:32:14] bblack: damn [21:32:16] rolling back [21:32:18] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357727 (owner: 1020after4) [21:32:31] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357727 (owner: 1020after4) [21:32:49] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.2 [21:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:42] hmmm the "too many down" craziness cleared itself before it could alert [21:33:45] (03PS9) 10Faidon Liambotis: Puppet compiler for Tim's redirects.dat DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [21:33:45] beats me heh [21:35:22] 10Operations, 10Labs, 10netops: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406#3330525 (10chasemp) [21:35:23] that's weird [21:39:44] (03PS1) 10BBlack: cache_maps: remove other misc refs [puppet] - 10https://gerrit.wikimedia.org/r/357729 [21:40:10] (03PS2) 10BBlack: cache_maps: remove other misc refs [puppet] - 10https://gerrit.wikimedia.org/r/357729 (https://phabricator.wikimedia.org/T164608) [21:41:34] (03CR) 10BBlack: [C: 032] cache_maps: remove other misc refs [puppet] - 10https://gerrit.wikimedia.org/r/357729 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [21:47:03] so it doesn't look good for wmf.4 [21:47:20] twentyafterfour: I guess you'll be writing an outage report, no? [21:47:37] paravoid: was there an outage? [21:48:03] sorry, I assumed there was because of the rollback + pybal stuff that bblack mentioned [21:48:05] a feature was momentarily broken [21:48:12] haven't been paying close attention [21:48:28] the pybal stuff did sound like an outage but I didn't actually see it and bblack said it resovled it's self before alerting so I don't know [21:49:25] I still don't understand how "a feature was broken" and "catchable fatal error" passed all of our CI/canaries and made it into prod, but I'll leave you guys handle that part [21:49:38] I don't see any gaps in logs to indicate a total outage and I only deployed to group1 so it would be even more strange if everything went down from that [21:49:51] yeah, it might have been related to the kafka stuff, dunno [21:49:57] * twentyafterfour digs through more metrics and logs [21:49:58] bblack would know [21:50:24] (03CR) 10Thcipriani: [C: 031] scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720 (owner: 10Faidon Liambotis) [21:50:43] oh there was a big spike in dbreplication errors around the same time I think [21:50:45] hmmm [21:53:12] I'm seeing a lot of search failures [21:53:17] Elastica\Exception\PartialShardFailureException [21:53:27] those were happening before and after deployments though [21:53:39] no change in frequency that I can see but it still might be worrying? [21:54:00] "Reason":"Unknown GeoDistance ordinal [21:54:32] I was discussing this with ebernhardson [21:54:33] the pybal error was brief, and I don't think it was kafka-related (that was just a puppetization issue, not a runtime issue) [21:54:40] still not seeing evidence of any outage [21:54:45] it only showed for 1/3 in icinga web UI, then vanished [21:54:56] I can try to find it in icinga logs though [21:55:06] (or pybal logs I guess) [21:55:29] hmm, and it did seem to coincide with the deployment but that was just group1 so it shoudln't have had such drastic consequences? [21:55:54] ebernhardson, I guess we can just livehack this ^ while the migration is going on? [21:56:17] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=now-3h&to=now [21:56:29] ^ text had two small 5xx spikes around 20:41 and 21:13 [21:56:29] (03PS10) 10Faidon Liambotis: Puppet compiler for Tim's redirects.dat DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [21:56:31] (03PS1) 10Faidon Liambotis: mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733 [21:56:34] not very significant, though [21:57:10] MaxSem: ideally add a config variable and set it :P [21:57:21] MaxSem: but you could live hack i suppose [21:57:23] (03CR) 10jerkins-bot: [V: 04-1] Puppet compiler for Tim's redirects.dat DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [21:57:47] bblack: thanks for checking [21:58:49] the icinga log entries are: [21:58:54] [1496865871] SERVICE ALERT: lvs1003;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1024.eqiad.wmnet because of too many down!: prometheus_80 - Could not depool server prometheus1003.eqiad.wmnet because of too many down!: wdqs_80 - Could not depool server wdqs1001.eqiad.wmnet because of too many down!: thumbor_8800 - Could not dep [21:59:00] ool server thumbor1002.eqiad.wmnet because of too many down! [21:59:02] [1496865941] SERVICE ALERT: lvs1003;PyBal backends health check;OK;SOFT;2;PYBAL OK - All pools are healthy [21:59:06] [1496872436] SERVICE ALERT: lvs1003;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - api-https_443 - Could not depool server mw1200.eqiad.wmnet because of too many down!: swift-https_443 - Could not depool server ms-fe1005.eqiad.wmnet because of too many down!: appservers-https_443 - Could not depool server mw1216.eqiad.wmnet because of too many down!: rendering-https_443 - Could n [21:59:11] ot depool server mw1295.eqiad.wmnet because of too many down! [21:59:12] (03CR) 10Hashar: "modules/nodepool/templates/nodepool.yaml.erb has:" [puppet] - 10https://gerrit.wikimedia.org/r/357659 (https://phabricator.wikimedia.org/T165211) (owner: 10Andrew Bogott) [21:59:14] [1496872506] SERVICE ALERT: lvs1003;PyBal backends health check;OK;SOFT;2;PYBAL OK - All pools are healthy [22:00:14] (03PS11) 10Faidon Liambotis: mediawiki: puppet compiler for Tim's redirects DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [22:00:19] could pybal be mistaken? I would think that would show up in other graphs if that many hosts were down [22:00:37] i think so, it says it wanted to depool elastic1024 but i just checked and the whole cluster is healthy [22:01:18] there should be no reason for it to have depooled so many elastic servers that it doesn't want to depool any more [22:01:34] those 4x epoch times above are: 20:04:31 20:05:41 21:53:56 21:55:06 [22:02:04] which does not coincide with the deployment time (close but not exactly) [22:02:22] (03PS2) 10Faidon Liambotis: mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733 [22:02:28] something must be off with my conversion [22:02:41] because I noted seeing the alert in the UI at 21:31 [22:02:42] 20:31 [22:02:48] er yeah 21:31 [22:03:06] which DOES coincide with the deployment [22:03:11] or icinga's timestamps are awful? [22:03:27] I could buy that [22:03:41] I have no idea, call it a fluke coincidence maybe [22:03:55] we get "could not depool" alerts, especially brief ones, all the time really [22:04:02] just odd to see so many services in one of them [22:05:53] yeah that's what I thought was strange as well [22:07:06] !log lvs1005 - restarting pybal to remove old maps table entries [22:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:16] !log lvs2005 - restarting pybal to remove old maps table entries [22:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:01] !log lvs3004 - restarting pybal to remove old maps table entries [22:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:03] !log lvs4004 - restarting pybal to remove old maps table entries [22:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:01] !log lvs1002 - restarting pybal to remove old maps table entries [22:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:48] !log lvs2002 - restarting pybal to remove old maps table entries [22:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:44] !log lvs3002 - restarting pybal to remove old maps table entries [22:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:30] !log lvs4002 - restarting pybal to remove old maps table entries [22:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:51] !log puppet node clean+deactivate for cp3003 [22:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:00] !log reimaging ex-cache_maps hosts (fresh role::spare::system installs) [22:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:32] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [22:24:45] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3239715 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp3004.esams.wmnet', 'cp3005.esams.wmnet', 'cp3006.e... [22:27:32] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 8 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:32:39] (03PS1) 10BBlack: site.pp: commentary for ex-cache_maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/357734 (https://phabricator.wikimedia.org/T164608) [22:35:16] (03CR) 10BBlack: [C: 032] site.pp: commentary for ex-cache_maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/357734 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [22:35:48] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics-privatedata-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T167116#3330674 (10Dzahn) ``` [stat1002:~] $ id goransm uid=16664(goransm) gid=500(wikidev) groups=500(wikidev),731(analytics-privatedata-users),784... [22:36:19] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics-privatedata-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T167116#3330675 (10Dzahn) 05Open>03Resolved [22:39:20] (03CR) 10Dzahn: [C: 031] check_ipmi_temp: load ipmi_devintf [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [22:46:32] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:52:42] twentyafterfour: are those two log-error bugs (wikidata and revision slider) at a level of blcoking the trian too? or just that popups one? [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170607T2300). [23:00:04] Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:34] here! [23:03:47] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3330724 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp3004.esams.wmnet', 'cp3005.esams.wmnet', 'cp4011.u... [23:06:37] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 9 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:07:14] twentyafterfour: hey you there? [23:07:14] i need more information about this popups bug [23:07:25] yeah... [23:07:30] and no tabbycat [23:07:38] greg-g: where is it deployed? [23:07:42] wmf4 i mean? [23:07:44] i can run some checks [23:07:50] but im not seeing anything broken [23:07:50] https://tools.wmflabs.org/versions/ [23:07:55] group0 [23:08:37] https://www.mediawiki.org/wiki/Special:Version is telling me 1.30.0-wmf.2 [23:09:45] page previews is working fine on https://test.wikipedia.org/wiki/Main_Page [23:10:14] so is navigation popups [23:11:07] greg-g: so i dont think this blocks deployment [23:11:15] (also does this mean my swat deploy is cancelled?) [23:11:42] only if no one who can swat is around [23:12:07] okay ill bug RoanKattouw and MaxSem and twentyafterfour :) [23:12:20] greg-g: i've removed that as a deployment blocker [23:12:25] thanks [23:12:36] In meeting [23:14:03] (03PS2) 10Dzahn: gerrit: switch to base::service_unit, import systemd unit from package [puppet] - 10https://gerrit.wikimedia.org/r/356516 [23:15:05] hmmm. [23:15:06] Cannot access the database: Access denied for user 'wikiuser'@'10.64.%' to database '0' (10.64.16.191)) [23:15:12] ^ [23:15:15] Where are you getting that? [23:15:18] (03CR) 10jerkins-bot: [V: 04-1] gerrit: switch to base::service_unit, import systemd unit from package [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [23:15:19] when trying to log in on test.wp.org [23:15:49] RoanKattouw: able to do a swat? I've got a couple of patches I need out [23:16:28] (03CR) 10Dzahn: "(per IRC talk with Moritz) amended and reused this, so now it doesn't bother with the existing sysvinit script anymore and also switches t" [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [23:18:21] jdlrobson: The second one looks like a typo? [23:18:34] (03PS3) 10Dzahn: gerrit: switch to base::service_unit, import systemd unit from package [puppet] - 10https://gerrit.wikimedia.org/r/356516 [23:19:25] (03PS4) 10Dzahn: gerrit: switch to base::service_unit, import systemd unit from package [puppet] - 10https://gerrit.wikimedia.org/r/356516 [23:20:35] (03PS5) 10Dzahn: gerrit: switch to base::service_unit and systemd [puppet] - 10https://gerrit.wikimedia.org/r/356516 [23:22:09] jdlrobson: I'm doing the first one (session ID should not change) now, but for the second one I'll need you to respond with a Gerrit link that isn't from 2012 ;) [23:22:18] argg not again [23:22:21] bad typo [23:22:41] https://gerrit.wikimedia.org/r/357666 < RoanKattouw [23:22:47] (03PS2) 10Jdlrobson: Relaunch related pages A/B test to 98% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357666 (https://phabricator.wikimedia.org/T167310) [23:22:55] Thanks [23:23:03] I'll have that out shortly, can you fix the wiki page in the meantime? [23:23:16] (03CR) 10Catrope: [C: 032] Relaunch related pages A/B test to 98% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357666 (https://phabricator.wikimedia.org/T167310) (owner: 10Jdlrobson) [23:23:28] RoanKattouw: already on it :) [23:23:32] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [23:24:39] (03Merged) 10jenkins-bot: Relaunch related pages A/B test to 98% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357666 (https://phabricator.wikimedia.org/T167310) (owner: 10Jdlrobson) [23:25:44] (03CR) 10jenkins-bot: Relaunch related pages A/B test to 98% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357666 (https://phabricator.wikimedia.org/T167310) (owner: 10Jdlrobson) [23:26:30] !log ppchelko@tin Started deploy [trending-edits/deploy@e0a8716]: Include reverts from bots to get rid of false positives [23:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:43] jdlrobson: sorry... [23:28:02] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Relaunch related pages A/B test to 98% of users on enwiki (T167310) (duration: 00m 44s) [23:28:06] jdlrobson: Deploying the 98% patch now, foregoing mwdebug because it's a probabilitistic one anyway [23:28:10] I don't know more, will need to ask tabbycat I guess? [23:28:10] Oh there, it's done [23:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:12] T167310: Re-launch related pages a/b test on mobile enwiki - https://phabricator.wikimedia.org/T167310 [23:28:23] twentyafterfour: sounds like it. Are they subscribed to the bug? [23:28:32] twentyafterfour: but as far as im concerned you dont need to block deployment on it [23:28:38] i can check it myself [23:28:48] (03CR) 10Dzahn: "should we keep using gerrit.sh for execStart/Stop or should we use "/bin/java .." and "/bin/kill .." instead? see inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [23:29:40] jdlrobson: I was blocking deployment on the other errors mostly, plus the anomaly that brandon noticed [23:30:04] it was just a bunch of things all at the same time which inclined me to roll back [23:30:09] !log catrope@tin Synchronized php-1.30.0-wmf.4/extensions/RelatedArticles/resources/ext.relatedArticles.readMore.eventLogging/index.js: T167236 (duration: 00m 43s) [23:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:19] T167236: userSessionToken in RelatedArticles schema does not seem to survive beyond one pageview - https://phabricator.wikimedia.org/T167236 [23:30:25] twentyafterfour: the RevisionSlider error? [23:30:54] thanks RoanKattouw looking good from my side [23:31:11] jdlrobson: And there's the other one too [23:31:24] jdlrobson: yeah mostly that one but T167360 looks like it may be a problem as well [23:31:25] T167360: "Warning: in_array() expects parameter 2 to be an array or collection" from Wikibase MwTimeIsoFormatter - https://phabricator.wikimedia.org/T167360 [23:32:53] i dont know wikibase sadly :/ [23:32:57] yeah [23:33:31] !log ppchelko@tin Finished deploy [trending-edits/deploy@e0a8716]: Include reverts from bots to get rid of false positives (duration: 07m 00s) [23:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:43] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [23:51:15] Twentyafterfour I can look at the wikibase issue in a few minutes