[00:00:05] now: I've got to run, I'm already running late for my thing! [00:00:14] Thanks thcipriani|afk [00:00:20] yw :) [00:21:11] !log gerrit: rolled back to 2.13.4-13-gc0c5cc4742 from 2.13.8. T152640 rearing its ugly head again (login issues) [00:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:21] T152640: Cannot log into Gerrit as of recent upgrade - https://phabricator.wikimedia.org/T152640 [00:27:10] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:34:40] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 1890 [00:37:41] RoanKattouw, fixed it. It wasn't us, though: https://gerrit.wikimedia.org/r/#/c/357523/ [00:38:17] (03PS1) 10Chad: gerrit (2.13.8+git1-wmf.5) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/357524 [00:39:43] (03PS2) 10Chad: gerrit (2.13.8+git1-wmf.5) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/357524 (https://phabricator.wikimedia.org/T158946) [00:40:19] (03CR) 10Paladox: [C: 031] gerrit (2.13.8+git1-wmf.5) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/357524 (https://phabricator.wikimedia.org/T158946) (owner: 10Chad) [00:41:06] (03CR) 10Dzahn: [C: 032] gerrit (2.13.8+git1-wmf.5) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/357524 (https://phabricator.wikimedia.org/T158946) (owner: 10Chad) [01:00:20] 10Operations, 10Prometheus-metrics-monitoring: prometheus-node-exporter - invalid group: ‘prometheus:prometheus' - https://phabricator.wikimedia.org/T167245#3321908 (10Dzahn) [01:31:20] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:59:20] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [02:27:51] 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3321960 (10greg) >>! In T166888#3321592, @faidon wrote: > So it doesn't really like sound like the primary use case is jobs like the operations/puppet linting, at least... [02:29:51] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [02:30:54] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 57s) [02:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:51] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [02:44:28] (03PS1) 10Aude: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (for beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357532 [02:45:38] (03PS2) 10Aude: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (for beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357532 (https://phabricator.wikimedia.org/T158323) [02:49:32] (03PS1) 10Aude: Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) [03:04:46] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.4) (duration: 14m 29s) [03:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:40] PROBLEM - Apache HTTP on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.076 second response time [03:07:40] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.282 second response time [03:09:20] PROBLEM - HHVM rendering on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [03:10:20] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 74476 bytes in 2.565 second response time [03:11:40] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jun 7 03:11:40 UTC 2017 (duration 6m 54s) [03:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:30] PROBLEM - HHVM rendering on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [03:22:30] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 74415 bytes in 1.189 second response time [04:25:11] PROBLEM - HHVM rendering on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.076 second response time [04:26:11] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 74384 bytes in 0.605 second response time [04:30:11] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [05:08:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:08:50] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:11:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:12:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:26:50] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:28:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:28:47] <_joe_> uhm [05:29:33] it seems that was triggered by scap for wikidata deploy? [05:29:40] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:29:51] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [05:30:08] (03PS1) 10Marostegui: db-eqiad.php: Repool db1053, depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357550 (https://phabricator.wikimedia.org/T166206) [05:30:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:31:08] <_joe_> yeah again [05:31:17] <_joe_> mutante: I think tha't a red herring [05:31:24] ok [05:31:35] <_joe_> also, I don't see the scap sync [05:31:50] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:31:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1053, depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357550 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [05:31:59] <_joe_> mutante: I think it's varnish, but since it's been going on for a fat 4 hours now, I'll first have breakfast [05:32:06] yea, that was already completed over 2 hours ago. 20:08 < logmsgbot> !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.4) (duration: 14m 29s) [05:32:32] <_joe_> it's 3 hours sorry [05:32:43] <_joe_> since 2.40 UTC we have those peaks [05:32:51] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [05:33:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:33:32] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1053, depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357550 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [05:33:45] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1053, depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357550 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [05:34:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [05:35:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1053, depool db1056 - T166206 (duration: 01m 03s) [05:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:17] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [05:35:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:36:40] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:39:59] <_joe_> it seems the cause is not varnish [05:40:20] <_joe_> marostegui: can you hold on with you changes for now? [05:41:06] yes [05:41:13] I am actually checking it was not my change [05:41:30] because it went straight after my deploy [05:41:42] Ah no, it started before [05:41:59] (my irssi was a bit crazy) [05:42:02] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors-datacenters?orgId=1 says otherwise and that it's all green [05:42:04] BUt yes, I am not doing anything else [05:42:12] but i was about to say that earlier too and then it wasnt again [05:42:35] <_joe_> yeah there is a recurring issue, clearly [05:44:41] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:03:40] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:04:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:04:50] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:05:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:09:20] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=345.40 Read Requests/Sec=3420.20 Write Requests/Sec=11.70 KBytes Read/Sec=26165.60 KBytes_Written/Sec=3558.80 [06:16:20] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.70 Read Requests/Sec=3.90 Write Requests/Sec=76.80 KBytes Read/Sec=16.40 KBytes_Written/Sec=420.40 [06:28:31] I checked https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X and the last peak of 503s seemed to be related to ints for some eqiad cp hosts, like cp1055 [06:29:02] maybe this is the recurring varnish backend issue [06:30:00] <_joe_> elukey: no. [06:30:06] mmm now that I am checking I am seeing more, nevermind [06:30:12] <_joe_> coordinate with others [06:30:14] <_joe_> ;) [06:30:20] <_joe_> we figured it out in the meantime [06:30:40] ah didn't check sec, thanks [06:44:55] (03PS1) 10Marostegui: db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357552 [06:46:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357552 (owner: 10Marostegui) [06:47:57] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357552 (owner: 10Marostegui) [06:48:06] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357552 (owner: 10Marostegui) [06:49:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 - T166206 (duration: 00m 44s) [06:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:23] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [06:58:10] (03PS2) 10Elukey: Delete unused role/common/analytics/hadoop configs [puppet] - 10https://gerrit.wikimedia.org/r/357418 [07:00:07] (03CR) 10Elukey: [C: 032] Delete unused role/common/analytics/hadoop configs [puppet] - 10https://gerrit.wikimedia.org/r/357418 (owner: 10Elukey) [07:02:01] mutante: if you are around, https://gerrit.wikimedia.org/r/#/c/356236 was not merged on puppet master [07:02:31] seems easy enough, it only touches tox.ini [07:04:22] +1 by hashar, +2 by Daniel seems good enough, merging [07:05:37] mutante: merged! [07:07:51] just tried tox -e pep8 ./modules/varnish/files/varnishapi.py, works fine on my laptop [07:08:54] (flake8 dep is downloaded correctly, everything passes) [07:11:07] 10Operations, 10HHVM: Upload hhvm to stretch apt repo in apt.wikimedia.org - https://phabricator.wikimedia.org/T167225#3322240 (10MoritzMuehlenhoff) 05Open>03declined We won't provide HHVM packages for stretch before we start the stretch migration of the production app servers since building HHVM and the e... [07:20:59] (03PS1) 10Ema: VCL: basic support to return HTTP 420, apply it to UA:wikiScrape/0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/357557 [07:22:16] !log Deploy alter table on db1047 enwiki.revision - T162807 [07:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:26] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [07:22:40] marostegui: do1047 is in your nightmares nowadays :D [07:22:45] *db [07:22:49] I know…. [07:22:59] thanks for working on it! [07:23:04] Can't wait to decommission that host! [07:23:15] let me know if I can help! Maybe there is a chance to learn something [07:23:32] haha nah, it is just a long running alter table [07:24:58] I promise that I will not stop mysql this time before checking alter tables :D [07:25:33] marostegui: about db104[67], do you think that we'll be able to order the hw next quarter ? [07:25:50] elukey: I would hope so, yes [07:26:05] super, let me know we need to do anything [07:26:18] elukey: Will do, not sure under which budget that goes [07:26:49] good question [07:32:02] elukey: sorry for that and yes, you did right. thank you! /me out again [07:32:43] 10Operations, 10netops: Faulty link between cr2-codfw and cr1-eqdfw - https://phabricator.wikimedia.org/T167261#3322265 (10ayounsi) [07:37:21] (03PS1) 10Muehlenhoff: Use new repository layout for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) [07:53:28] (03CR) 10Muehlenhoff: "Patch for apt source via https://gerrit.wikimedia.org/r/#/c/357559/" [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [08:02:19] !log Run redact_sanitarium on db1095 for dewiki - T153743 [08:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:27] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [08:02:50] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [08:06:29] (03PS3) 10Filippo Giunchedi: aptrepo: add hp-mcp-stretch and thirdparty/hwraid [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) [08:10:13] (03PS1) 10Muehlenhoff: Configure fixed lock manager ports for labstore NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/357562 (https://phabricator.wikimedia.org/T165136) [08:12:08] (03CR) 10Gehel: Add Shiny Server module and Discovery Dashboards role/profile (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga) [08:12:19] (03CR) 10Filippo Giunchedi: aptrepo: add hp-mcp-stretch and thirdparty/hwraid (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [08:16:42] (03CR) 10Gehel: kibana: allow any arbitrary setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356900 (owner: 10Hashar) [08:22:44] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3322381 (10ops-monitoring-bot) [08:23:22] (03CR) 10Hashar: "I felt that adding settings each time we need one would be cumbersome. The only reason I made this totally open was to add "elasticsearch" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356900 (owner: 10Hashar) [08:27:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357564 (https://phabricator.wikimedia.org/T166206) [08:28:53] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3322398 (10fgiunchedi) [08:29:05] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3322381 (10fgiunchedi) a:03Cmjohnson [08:32:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357564 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [08:33:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357564 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [08:33:49] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357564 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [08:34:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1056 - T166206 (duration: 00m 43s) [08:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:44] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [08:34:51] (03CR) 10Muehlenhoff: [C: 031] aptrepo: add hp-mcp-stretch and thirdparty/hwraid [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [08:37:01] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3322434 (10Gehel) [08:37:43] 10Operations, 10Prometheus-metrics-monitoring: prometheus-node-exporter - invalid group: ‘prometheus:prometheus' - https://phabricator.wikimedia.org/T167245#3321908 (10fgiunchedi) Indeed, the problem there I think is that `prometheus` user exists in labs but not the group, was node-exporter working otherwise? [08:38:13] 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3322464 (10ema) [08:38:52] 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3318229 (10ema) [08:40:20] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:2 - OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK [08:40:22] ACKNOWLEDGEMENT - HP RAID on ms-be1016 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:2 - OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167268 [08:40:25] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167268#3322470 (10ops-monitoring-bot) [08:41:20] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdh1] [08:42:21] 10Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167268#3322474 (10Volans) [08:42:32] godog: ^^^ also puppet broken, as expected ;) [08:42:40] you're breaking too many disks lately :-P [08:43:15] ask the users to upload/download less, less disks broken :P [08:43:42] :D [08:43:56] I need to check later why we got 2 tasks though [08:44:23] T167268 and T167264 seems the same, but have different output [08:44:23] T167268: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167268 [08:44:23] T167264: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264 [08:44:24] my fault, the first is manual because I thought the disk was already failed on the controller but it wasn't [08:44:38] then I marked the disk failed manually on the controller too [08:46:29] 10Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167268#3322489 (10fgiunchedi) [08:46:31] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3322487 (10fgiunchedi) [08:46:39] godog: got it :) feel free to clse one [08:46:53] one less thing to check for me, mistery solved ;) [08:46:57] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3322381 (10fgiunchedi) ``` 09:43 I need to check later why we got 2 tasks though 09:44 my fault, the first is manual because I thought the disk was already failed on the controller but it wasn't... [08:47:35] yeah, we're still on the mistery of why hw can be so crap sometimes [08:50:19] !log Deploy alter table s4 - db1056 - T166206 [08:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:28] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [08:51:46] (03PS1) 10Gehel: Upgrade kibana to v5.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/357565 (https://phabricator.wikimedia.org/T167266) [08:54:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357566 (https://phabricator.wikimedia.org/T166205) [08:54:39] (03PS1) 10Elukey: Bump zookeeper client version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357567 [08:54:55] (03Abandoned) 10Elukey: Bump zookeeper client version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357567 (owner: 10Elukey) [08:55:05] not needed, pebkac [08:56:32] (03PS1) 10Filippo Giunchedi: swift: mask object reconstructor on >= jessie [puppet] - 10https://gerrit.wikimedia.org/r/357568 (https://phabricator.wikimedia.org/T162609) [08:56:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357566 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [08:57:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357566 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [08:57:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357566 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [08:58:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1076 - T166205 (duration: 00m 43s) [08:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:43] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [08:58:52] !log Deploy alter table on s2 - db1076 - T166205 [08:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:05] (03CR) 10Filippo Giunchedi: [C: 032] swift: mask object reconstructor on >= jessie [puppet] - 10https://gerrit.wikimedia.org/r/357568 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [09:08:37] 10Operations: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703#3322545 (10ema) p:05Triage>03Normal [09:08:55] (03PS1) 10Filippo Giunchedi: install_server: move ms-be2* trusty hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357569 (https://phabricator.wikimedia.org/T162609) [09:11:44] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/357565 (https://phabricator.wikimedia.org/T167266) (owner: 10Gehel) [09:12:44] (03PS2) 10Gehel: Upgrade kibana to v5.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/357565 (https://phabricator.wikimedia.org/T167266) [09:15:11] !log upgrade lvs4001-4004 to jessie 8.8 point release T164703 [09:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:20] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [09:16:09] (03CR) 10Gehel: [C: 032] Upgrade kibana to v5.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/357565 (https://phabricator.wikimedia.org/T167266) (owner: 10Gehel) [09:16:47] (03CR) 10Filippo Giunchedi: [C: 032] install_server: move ms-be2* trusty hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357569 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [09:16:47] (03PS2) 10Filippo Giunchedi: install_server: move ms-be2* trusty hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357569 (https://phabricator.wikimedia.org/T162609) [09:22:21] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3322556 (10Gehel) kibana 5.3.3 is now uploaded to reprero [09:28:58] !log upgrading kibana to v5.3.3 on logstash cluster - T167266 [09:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:08] T167266: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266 [09:32:12] gehel: FYI there's cronspam from logstash, /bin/sh: 1: /usr/local/bin/logstash_delete_index: not found [09:32:27] damn, still there? [09:32:48] yeah logstash1002 sent it this morning too [09:32:57] only that machien though [09:34:01] strange, that's the old cron that has been removed last week. I can't see it with crontab -l ... [09:34:46] it'd be in /var/spool/cron/crontabs then [09:35:32] nope, not there either [09:38:18] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3322593 (10Gehel) Kibana is now upgraded on the production logstash cluster and in deployment-logstash2 [09:41:38] odd, let me know if you find it! [09:42:01] godog: I'm grepping left and right, but no, I don't find it... [09:42:55] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3322617 (10Gehel) Note: this upgrade is patching the issues : https://discuss.elastic.co/t/elastic-stack-5-4-1-and-5-3-3-security-updates/87952 [09:43:14] meh the only other thing I can think of is the olde "service cron restart" [09:44:06] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3322434 (10MoritzMuehlenhoff) Which is CVE-2017-8440 (just to have the ID indexed when searching for it in Phab) [09:45:53] godog: I don't see the email though [09:46:20] ahhh got into spam [09:46:57] technically correct! [09:47:32] godog: https://giphy.com/gifs/the-simpsons-trash-garbage-5xaOcLCBzBw4QrtdDP2 [09:48:05] gehel: in which file was logstash_delete_index referenced? [09:48:07] haahh yes I was rummaging in spam [09:48:59] volans: the cron was setup by puppet, so most probably in /var/spool/crontabs [09:49:03] !log upgrade lvs[3001-3004] to jessie 8.8 point release T164703 [09:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:11] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [09:49:13] I can see the log where puppet removed that cron [09:49:47] before or after the email? [09:49:55] well before [09:50:22] I think I'll try godog advice and check tomorrow ... [09:50:37] lunch, back later [09:55:41] RECOVERY - IPMI Temperature on ms-be2014 is OK: Sensor Type(s) Temperature Status: OK [09:56:43] meh, I just rebooted it for unrelated reasons [09:59:52] volans: good morning. I hope I havent not been too pedantic on the flake8_no_extension review ( https://gerrit.wikimedia.org/r/#/c/357197 ) [10:00:08] (03PS1) 10Filippo Giunchedi: install_server: s/ubuntu/debian/ for stretch ms-be2* machines [puppet] - 10https://gerrit.wikimedia.org/r/357576 [10:00:16] godog: interesting, the reboot fixed the IPMI temperature check? [10:01:06] hashar: no, have you seen the latest patchset ;) [10:01:09] ema: seems like it [10:01:20] (03PS2) 10Filippo Giunchedi: install_server: s/ubuntu/debian/ for stretch ms-be2* machines [puppet] - 10https://gerrit.wikimedia.org/r/357576 [10:01:35] volans: yeah gotta review it :-} [10:02:12] volans: the py.erb are all bad design patterns really ... then there is only 9 such files so it is not too much of a troubles [10:02:28] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] install_server: s/ubuntu/debian/ for stretch ms-be2* machines [puppet] - 10https://gerrit.wikimedia.org/r/357576 (owner: 10Filippo Giunchedi) [10:02:29] should be much simpler now, than it could be modified later to be reused for different checks (rake, shellcheck, etc) [10:02:53] 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3322672 (10ema) @fgiunchedi just rebooted ms-be2014 for unrelated reasons and the reboot alone fixed the issue. Perhaps IPMI can end up in some weird state and that gets fixed upon reboot? [10:02:55] hashar: yeah, that's why I'd like it to show them so that we can fix them ;) 1 or 2 are .erb without .py.erb also [10:08:12] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, and 3 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3322681 (10Joe) [10:10:20] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:10:21] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:12:11] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [10:12:11] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:17:35] 10Operations, 10Goal, 10Kubernetes: Define a process to keep images up-to-date on similar standards as the rest of production - https://phabricator.wikimedia.org/T162043#3322706 (10Joe) [10:17:55] 10Operations, 10Kubernetes, 10Prod-Kubernetes (Experiment), 10User-Joe: Make security updates of docker images manageable - https://phabricator.wikimedia.org/T167269#3322625 (10Joe) [10:26:09] !log upgrade lvs1*/lvs2* to jessie 8.8 point release T164703 [10:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:18] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [10:29:14] !log installing tiff regression security update on trusty [10:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:31] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:30] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:33:39] the lvs2006 puppetfail was me ^ [10:34:41] (03CR) 10Hashar: [C: 031] "Thank you for the follow up. On my machine the script takes roughly 5 seconds that will be added to every build of the job, but i think th" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans) [10:35:05] volans: +1 from me. And maybe we can fix up the few shebangs that prevent libmagic detection [10:35:36] thanks for the review [10:37:12] volans: twist, if you pass --statistics to flake8, it shows a a breakdown count of each errors :} [10:37:30] most seems to e whitespace related so should be easy to fix up [10:47:42] 10Operations, 10netops: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274#3322805 (10ayounsi) [10:54:31] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [11:01:02] PROBLEM - DPKG on lvs1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:01:55] me again ^ [11:02:02] RECOVERY - DPKG on lvs1002 is OK: All packages OK [11:03:01] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-node-exporter] [11:03:25] that's me instead ^ [11:03:41] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2017994 [11:04:01] PROBLEM - DPKG on lvs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:04:31] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3322892 (10MarcoAurelio) [11:05:01] RECOVERY - DPKG on lvs1001 is OK: All packages OK [11:06:05] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3283276 (10MarcoAurelio) @jcrespo and @Marostegui for your opinion. [11:07:11] PROBLEM - DPKG on lvs1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:08:44] !log restarting cron on logstash cluster [11:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:11] RECOVERY - DPKG on lvs1003 is OK: All packages OK [11:17:20] (03PS1) 10Ayounsi: LibreNMS: enable 2FA [puppet] - 10https://gerrit.wikimedia.org/r/357585 (https://phabricator.wikimedia.org/T164911) [11:19:09] (03CR) 10Ayounsi: [C: 032] LibreNMS: enable 2FA [puppet] - 10https://gerrit.wikimedia.org/r/357585 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [11:22:41] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [11:25:28] 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3322958 (10ema) According to [[http://www.gnu.org/software/freeipmi/freeipmi-faq.html#Why-am-I-seeing-so-many-_0027internal-IPMI-error_0027-or-_0027driver-busy_0027-messages_003f | the Freeipmi FAQs ]], /dev/ipmi0 shoul... [11:26:49] 10Operations, 10netops, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3322959 (10ayounsi) [11:30:01] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [11:30:11] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:30:12] RECOVERY - IPMI Temperature on ocg1002 is OK: Sensor Type(s) Temperature Status: OK [11:31:22] 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3322963 (10faidon) Great, thanks :) I'm looking at the output of a Jenkins job and it looks like it takes about a minute to execute, so I guess we have two semi-related... [11:33:01] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [11:41:15] (03CR) 10Faidon Liambotis: [C: 04-1] "5 seconds on every job is quite a bit -- and on the CI instances with empty pagecaches and in VMs in a shared infrastructure without SSDs " [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans) [11:42:40] 10Operations, 10netops, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3322995 (10ayounsi) [11:48:29] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:49:15] (03CR) 10Faidon Liambotis: [C: 04-1] "Minor syntax change inside. Moreover, d-i + labs_bootstrapvz + docker (+ package_builder) all need the same change as well. For them third" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [11:54:30] (03CR) 10Faidon Liambotis: [C: 04-1] "This would work, but IMHO it's ugly because:" [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [11:58:52] 10Operations, 10Office-IT, 10netops: Some BGP sessions to the SF Office down - https://phabricator.wikimedia.org/T167281#3323004 (10ayounsi) [12:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170607T1200). Please do the needful. [12:03:38] 10Operations, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10Patch-For-Review, 10User-Addshore: Deploy TwoColConflict extension to beta - https://phabricator.wikimedia.org/T154927#3323146 (10Tobi_WMDE_SW) [12:04:14] 10Operations, 10LDAP-Access-Requests, 10TCB-Team, 10Wikidata, 10Release-Engineering-Team (Kanban): Add Andrew and Aleksey to ldap/wmde group - https://phabricator.wikimedia.org/T152088#3323163 (10Tobi_WMDE_SW) [12:04:25] * aude waves [12:04:29] 10Operations, 10Electron-PDFs, 10TCB-Team, 10Patch-For-Review, 10User-Addshore: Deploy ElectronPdfService Extension to beta cluster - https://phabricator.wikimedia.org/T150945#3323176 (10Tobi_WMDE_SW) [12:04:31] 10Operations, 10Electron-PDFs, 10TCB-Team, 10Patch-For-Review, 10User-Addshore: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#3323179 (10Tobi_WMDE_SW) [12:05:04] 10Operations, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10Patch-For-Review, 10User-Addshore: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184#3323192 (10Tobi_WMDE_SW) [12:06:41] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#3323239 (10Tobi_WMDE_SW) [12:11:39] (03CR) 10Aude: [C: 032] Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (for beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357532 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:12:40] (03Merged) 10jenkins-bot: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (for beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357532 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:12:53] (03CR) 10jenkins-bot: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (for beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357532 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:14:20] 10Operations, 10Patch-For-Review: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094#3323294 (10MoritzMuehlenhoff) This has also independantly been reported at Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864341 [12:15:29] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:29] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:29] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:29] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:29] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:30] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:30] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:31] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:31] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:32] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:32] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:33] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:33] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:34] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:16:46] testing my change on mwdebug1002 [12:17:21] /away [12:17:26] arggg [12:17:29] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:17:30] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:30] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:17:30] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [12:17:30] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:17:30] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:30] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:17:31] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:17:31] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:32] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:17:32] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:17:33] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:33] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:34] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:18:19] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:18:29] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:19:58] !log aude@tin Synchronized wmf-config/Wikibase-labs.php: Site links for non-main namespace wiktionary pages (duration: 00m 44s) [12:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:15] jouncebot: refresh [12:25:17] I refreshed my knowledge about deployments. [12:25:18] jouncebot: next [12:25:18] In 0 hour(s) and 34 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170607T1300) [12:26:04] !log aude@tin Synchronized wmf-config/Wikibase.php: Site links for non-main namespace wiktionary pages T158323 (duration: 00m 43s) [12:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:13] T158323: enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) - https://phabricator.wikimedia.org/T158323 [12:26:22] (03PS1) 10Elukey: Test new zookeeper version on conf2002 [puppet] - 10https://gerrit.wikimedia.org/r/357590 [12:26:42] moritzm: --^ [12:26:44] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3323325 (10Marostegui) @MarcoAurelio thanks for the heads up!. We will make sure we have no schema maintenance scheduled for when you decide to run it. Ping any of... [12:27:12] 10Operations, 10Performance-Team, 10Thumbor: Package latest version of Thumbor and deploy it - https://phabricator.wikimedia.org/T167286#3323329 (10Gilles) [12:27:14] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3323342 (10Gilles) [12:27:23] (03CR) 10Aude: [C: 032] Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:28:07] hashar: i'd like to also deploy a hotfix https://gerrit.wikimedia.org/r/#/c/357530/ (to wmf4) [12:28:15] 10Operations, 10Performance-Team, 10Thumbor: Package latest version of Thumbor and deploy it - https://phabricator.wikimedia.org/T167286#3323357 (10Gilles) [12:28:17] 10Operations, 10Performance-Team, 10Thumbor: Backport python-schedule and add it to jessie-wikimedia - https://phabricator.wikimedia.org/T167287#3323344 (10Gilles) [12:28:22] 10Operations, 10Performance-Team, 10Thumbor: Backport python-schedule and add it to jessie-wikimedia - https://phabricator.wikimedia.org/T167287#3323344 (10Gilles) [12:28:22] maybe since i'm deploying other stuff now, can take care of it before swat [12:28:39] PROBLEM - Check Varnish expiry mailbox lag on cp1063 is CRITICAL: CRITICAL: expiry mailbox lag is 2038123 [12:28:45] aude: I am attending a tech talk right now [12:29:00] aude: but change looks good to me so please be bold!? [12:29:41] aude: note that wmf.4 is only on group0 for now [12:30:13] (03CR) 10Muehlenhoff: [C: 031] Test new zookeeper version on conf2002 [puppet] - 10https://gerrit.wikimedia.org/r/357590 (owner: 10Elukey) [12:30:19] yeah, the issue is on wmf4 only [12:30:34] still working on my changes for beta though [12:30:59] The .git directory is missing from extensions/Constraints/, see https://getcomposer.org/commit-deps for more information [12:31:00] :/ [12:31:20] https://gerrit.wikimedia.org/r/#/c/354522/ [12:31:39] we need to switch to composer install (at least for the wikidata build) to respect the lock file is there [12:31:54] ah [12:32:22] even that, i'm not 100% sure compsoser install won't have this specific issue [12:32:29] !log cp1072, cp1063 restarting varnish backend for mailbox lag [12:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:50] aude: I have noticed your patch to invoke "composer install" whenever a composer.lock is present. Guess I will have to review/test it [12:33:11] i think it works as composer update if there is no composer.lock [12:33:19] whenever you can look at it, that would be great [12:33:41] (03CR) 10Elukey: [C: 032] Test new zookeeper version on conf2002 [puppet] - 10https://gerrit.wikimedia.org/r/357590 (owner: 10Elukey) [12:34:13] (03PS2) 10Aude: Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) [12:34:22] (03CR) 10Aude: [C: 032] Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:35:45] (03Merged) 10jenkins-bot: Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:35:53] (03CR) 10jenkins-bot: Enable Wikibase Client on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357533 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [12:38:39] RECOVERY - Check Varnish expiry mailbox lag on cp1063 is OK: OK: expiry mailbox lag is 0 [12:40:00] !log upgrade zookeeper packages on conf2002 to 3.4.5+dfsg-2+deb8u2 [12:40:06] !log aude@tin Synchronized wmf-config/InitialiseSettings-labs.php: Enable Wikibase Client on beta wiktionary sites T158323 (duration: 00m 43s) [12:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:18] T158323: enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) - https://phabricator.wikimedia.org/T158323 [12:41:27] dcausse: jan_drewniak : I will not be available for SWAT sorry. [12:41:46] well I might, but I will be in some tech talk so it is going to be hard to handle it myself. [12:42:09] hashar: np, I can swat, no one from releng will be around? [12:43:19] !log Drop table updates on s2 - T139342 [12:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:30] T139342: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342 [12:43:39] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 0 [12:46:54] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3323381 (10jcrespo) Same recommendations apply than T167031#3315249 plus the additional of not making both at the same time. [12:48:33] (03PS1) 10Ema: base::kernel: create /etc/modules-load.d on Trusty systems [puppet] - 10https://gerrit.wikimedia.org/r/357591 [12:53:16] (03PS1) 10Jcrespo: mariadb: Remove old codfw db hosts from candidates for reimage [puppet] - 10https://gerrit.wikimedia.org/r/357592 [12:55:23] (03CR) 10Jcrespo: [C: 032] mariadb: Remove old codfw db hosts from candidates for reimage [puppet] - 10https://gerrit.wikimedia.org/r/357592 (owner: 10Jcrespo) [12:56:54] !log aude@tin Synchronized php-1.30.0-wmf.4/extensions/Wikidata: Fix parser function registration T167238 (duration: 02m 20s) [12:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:04] T167238: Fatal Error: "Tag hook for noexternallanglinks is not callable" (1.30.0-wmf.4/includes/parser/Parser.php) - https://phabricator.wikimedia.org/T167238 [12:57:29] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [12:58:21]