[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T0000). Please do the needful. [00:00:04] Krinkle: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:16] o/ [00:00:23] \m/ [00:00:55] doing [00:04:05] Krinkle, pulled on mwdebug1002 [00:04:15] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [00:04:19] MaxSem: OK. verifying now.. [00:05:40] MaxSem: Doesn't appear to be applied. [00:06:50] MaxSem: Hm.. let me try again [00:06:55] (03PS4) 10Dzahn: redirects.dat - split non-canonical to separate section [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [00:07:52] (03CR) 10Dzahn: [C: 031] "needed manual rebase - done" [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [00:08:00] Krinkle, the server FS has the right version. Caching? [00:08:08] Wrong MW version? [00:08:12] Ah, need to test on group0 only [00:08:15] yeah, 1min [00:09:13] yeah, I just got a backport out for wmf.12 late in the day. Wanted to make sure errors cleared. It's now late, so wmf.12 still on group0 only. [00:09:58] thcipriani, will train finish tomorrow? [00:10:48] MaxSem: OK. Good. It's verified and works as expected. [00:10:49] MaxSem: planning on it. I'll move it forward in my morning to group1 and then push to group2 in the normal window [00:10:51] (verified on test and test2) [00:12:24] !log maxsem@tin Synchronized php-1.29.0-wmf.12/extensions/Gadgets: https://gerrit.wikimedia.org/r/#/c/338004/ (duration: 00m 42s) [00:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:29] Krinkle, ^ [00:15:10] MaxSem: Thanks [00:15:17] :) [00:15:31] thcipriani: Any issues outstanding blocking the roll out? [00:15:34] Or did we get them all [00:15:45] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [00:15:52] There were a few about rdbms stuff. I've merged them in master. Haent' kept track of which have/haven't been backported [00:16:23] Krinkle: no outstanding issues, backported and deployed the last one, looks like there haven't been new errors. [00:16:28] okay [00:17:18] porting the instanceof solution to wmf.12 caused it to blow up in wmf.11 when I moved it forward for one of them. Hopefully the version change in the key will ensure that doesn't happen when I roll forward in the morning. [00:17:57] jouncebot: now [00:17:57] For the next 0 hour(s) and 42 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T0000) [00:19:14] mutante, I'm done with SWAT [00:19:35] ok :) i was just going to use mwdebug1001 to test something [00:19:43] and then revert to before [00:27:46] (03CR) 10Dzahn: [C: 031] "alright, i tried to actually test this" [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [00:28:37] (03CR) 10Dzahn: [C: 031] "that is "testing 209 urls on 1 servers" btw" [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [00:29:12] (03PS4) 10Dzahn: partman: delete raid1-lvm-ext4 recipe [puppet] - 10https://gerrit.wikimedia.org/r/337532 (https://phabricator.wikimedia.org/T156955) [00:33:19] (03CR) 10Dzahn: "what about "Requires=network.target". you don't use that but the "working example" has it." [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [00:35:42] (03CR) 10Dzahn: "does "before apache" work? do both services come up after a reboot of the machine?" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [00:48:35] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:53:33] (03PS1) 10Jcrespo: Repool db1082 with low load after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338037 (https://phabricator.wikimedia.org/T158188) [00:57:15] (03CR) 10Jcrespo: [C: 04-2] "See: I566e46bdbdca7fbe5 When a server crashes, its BP is not dumped properly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337860 (owner: 10Jcrespo) [00:57:30] (03CR) 10Jcrespo: [V: 032 C: 032] Repool db1082 with low load after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338037 (https://phabricator.wikimedia.org/T158188) (owner: 10Jcrespo) [00:57:49] (03CR) 10jenkins-bot: Repool db1082 with low load after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338037 (https://phabricator.wikimedia.org/T158188) (owner: 10Jcrespo) [00:59:16] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 with low load (duration: 00m 41s) [00:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T0100). [01:03:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [01:10:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [01:13:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [01:16:35] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [01:35:36] 06Operations, 10fundraising-tech-ops: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#3031904 (10Dzahn) a:03Dzahn [01:35:41] 06Operations, 10fundraising-tech-ops: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#2987822 (10Dzahn) 05Open>03Resolved [01:36:27] (03CR) 10Dzahn: "17:13 < icinga-wm> PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on conti" [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [02:31:56] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [02:33:06] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.11) (duration: 11m 46s) [02:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [02:40:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 0 below the confidence bounds [02:42:29] o/ does anyone know if renaming a Wikimedia GitHub repo would break mirroring? I believe GitHub redirects old URL usages but I wondered if anyone knew [02:46:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 0 below the confidence bounds [03:05:19] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.12) (duration: 14m 27s) [03:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:25] PROBLEM - Check Varnish expiry mailbox lag on cp3040 is CRITICAL: CRITICAL: expiry mailbox lag is 28355 [03:06:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [03:07:25] RECOVERY - Check Varnish expiry mailbox lag on cp3040 is OK: OK: expiry mailbox lag is 8 [03:11:01] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Feb 16 03:11:01 UTC 2017 (duration 5m 42s) [03:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:23] (03CR) 10Krinkle: "Yeah, if invoking clean --keep-static, we shouldn't remove the branch pointer probably. Rather, that would be done when later invoked anot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336901 (owner: 10Chad) [03:33:05] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.219 second response time [03:46:50] (03PS1) 10Krinkle: navtiming: Make tests easier to extend [puppet] - 10https://gerrit.wikimedia.org/r/338044 [03:47:39] (03CR) 10Krinkle: "Perhaps you'd like to rebase on I699c61e3ae20e which would make it easier to add the json objects below their ua-string equivalents. It wo" [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [03:48:54] (03PS2) 10Krinkle: navtiming: Make tests easier to extend [puppet] - 10https://gerrit.wikimedia.org/r/338044 [03:49:09] (03PS2) 10Krinkle: Enable wgEnableWANCacheReaper in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335704 (owner: 10Aaron Schulz) [03:55:18] (03CR) 10Krinkle: [C: 032] Enable wgEnableWANCacheReaper in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335704 (owner: 10Aaron Schulz) [03:56:45] (03Merged) 10jenkins-bot: Enable wgEnableWANCacheReaper in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335704 (owner: 10Aaron Schulz) [03:56:53] (03CR) 10jenkins-bot: Enable wgEnableWANCacheReaper in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335704 (owner: 10Aaron Schulz) [03:57:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 0 below the confidence bounds [04:00:05] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.212 second response time [04:00:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 0 below the confidence bounds [04:15:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 29 data above and 0 below the confidence bounds [05:03:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [05:04:45] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:11:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [05:16:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [05:32:45] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [05:39:36] PROBLEM - puppet last run on db1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:50:25] PROBLEM - Disk space on labnet1001 is CRITICAL: DISK CRITICAL - free space: / 1420 MB (3% inode=93%) [06:07:35] RECOVERY - puppet last run on db1052 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:10:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [06:28:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 0 below the confidence bounds [06:32:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 0 below the confidence bounds [07:10:25] PROBLEM - Disk space on labnet1001 is CRITICAL: DISK CRITICAL - free space: / 1418 MB (3% inode=93%) [07:11:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 28 data above and 2 below the confidence bounds [07:12:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [07:27:41] 06Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, and 4 others: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995#2960964 (10Nikerabbit) This broke the compact language links based on comment T153900#3011037. I'm submitting... [07:29:35] PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:29:55] PROBLEM - Disk space on elastic1029 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 60567 MB (12% inode=99%) [07:33:51] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3032318 (10Marostegui) Thanks @Papaul! I will get that ready! [07:34:25] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:34:45] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [07:37:36] (03PS1) 10Marostegui: db-eqiad.php: Increase load db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338067 (https://phabricator.wikimedia.org/T158188) [07:39:45] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [07:39:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase load db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338067 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui) [07:40:19] (03CR) 10Muehlenhoff: [C: 04-1] "Please hold that for now. I'll be doing an exhaustive review of all privileged LDAP groups soon (T129788), if it's all fine I'll merge aft" [puppet] - 10https://gerrit.wikimedia.org/r/333024 (owner: 10Addshore) [07:40:25] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [07:41:32] (03Merged) 10jenkins-bot: db-eqiad.php: Increase load db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338067 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui) [07:41:35] PROBLEM - Disk space on elastic1028 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 62222 MB (12% inode=99%) [07:41:41] (03CR) 10jenkins-bot: db-eqiad.php: Increase load db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338067 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui) [07:41:43] (03CR) 10Muehlenhoff: [C: 04-1] "Should be based on the recently merged generic solution added in 336420 (and consequently only enabled for 2001 initially)." [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429) (owner: 10Hashar) [07:43:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase load db1082 - T158188 (duration: 00m 42s) [07:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:17] T158188: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188 [07:46:31] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338068 [07:46:55] RECOVERY - Disk space on elastic1029 is OK: DISK OK [07:48:19] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338068 [07:50:40] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338068 (owner: 10Marostegui) [07:51:58] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338068 (owner: 10Marostegui) [07:52:06] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338068 (owner: 10Marostegui) [07:54:12] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2060 - T156161 (duration: 00m 44s) [07:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:17] T156161: db2060 not accessible - https://phabricator.wikimedia.org/T156161 [07:54:48] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3032346 (10Marostegui) I have repooled the server. [07:56:35] RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:57:55] PROBLEM - Disk space on elastic1029 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 62153 MB (12% inode=99%) [08:10:04] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3032362 (10fgiunchedi) >>! In T149451#2864911, @Ottomata wrote: > We could set up a special varnishkafka instance for this, if that makes sense. But, hm, I think using kafkatee would be better! k... [08:13:55] RECOVERY - Disk space on elastic1029 is OK: DISK OK [08:18:45] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:19:35] RECOVERY - Disk space on elastic1028 is OK: DISK OK [08:30:41] 06Operations: Separate dc ops group in pwstore - https://phabricator.wikimedia.org/T158285#3032373 (10MoritzMuehlenhoff) [08:31:36] 06Operations, 10ops-eqiad, 10DBA: Replace BBU for db1060 - https://phabricator.wikimedia.org/T158194#3032388 (10Marostegui) @Cmjohnson were you able to find a replacement BBU in the end? Thanks! [08:32:55] 06Operations, 10ops-eqiad: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3032396 (10Marostegui) 05Open>03stalled Wait for this to happen before we replace any disks on this task: https://phabricator.wikimedia.org/T158194 [08:37:42] (03PS1) 10Marostegui: db-eqiad.php: Restore db1082 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338078 (https://phabricator.wikimedia.org/T158188) [08:39:43] !log roll-restart jobrunner in codfw/eqiad to pick up fluorine -> mwlog1001 redis change - T123728 [08:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:48] T123728: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728 [08:41:36] 06Operations, 10Scap, 13Patch-For-Review, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#3032403 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This is completed, `use_experimental` can be removed once deploy... [08:42:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1082 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338078 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui) [08:43:49] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1082 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338078 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui) [08:44:03] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1082 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338078 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui) [08:44:25] PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:44:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore origina db1082 weight - T158188 (duration: 00m 41s) [08:44:55] PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:58] T158188: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188 [08:45:04] (03PS1) 10Muehlenhoff: Add one more LDAP user [puppet] - 10https://gerrit.wikimedia.org/r/338082 [08:45:45] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:46:14] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add one more LDAP user [puppet] - 10https://gerrit.wikimedia.org/r/338082 (owner: 10Muehlenhoff) [08:46:25] RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational [08:47:55] RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational [08:49:08] the systemd fail was some jobrunners not restarting in the salt run in codfw, fixed and now doing qeqiad [08:49:35] PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:50:35] RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational [08:50:54] 06Operations, 10DBA, 13Patch-For-Review: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188#3032411 (10Marostegui) [08:50:57] 06Operations, 10DBA, 13Patch-For-Review: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#3032410 (10Marostegui) [08:51:17] 06Operations, 10DBA, 13Patch-For-Review: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2633433 (10Marostegui) I have added the subtask of the last crash of this server, so we can have some tracking as it's been twice already. [08:52:19] 06Operations, 10DBA, 13Patch-For-Review: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188#3029269 (10Marostegui) I will close this ticket after restoring the original weight for this server. Also added a parent task, which is the first crash this server had back in September (T145533). It wi... [08:52:33] 06Operations, 10DBA, 13Patch-For-Review: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188#3032416 (10Marostegui) 05Open>03Resolved a:03Marostegui [08:52:35] 06Operations, 10DBA, 13Patch-For-Review: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2633433 (10Marostegui) [08:52:38] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3032419 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1051.eqiad.wmnet'] ``` The... [08:55:25] PROBLEM - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:55:35] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:55:35] PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:56:35] RECOVERY - Check systemd state on mw2157 is OK: OK - running: The system is fully operational [08:57:35] PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:58:45] PROBLEM - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:59:23] ah I get it, jobrunner gets broken pipe via salt it looks like [09:00:35] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:35] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:35] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:35] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:35] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:36] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:36] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:37] RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational [09:00:45] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:55] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:55] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:55] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:55] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:55] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:56] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:04] I will check that [09:01:05] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:05] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:05] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:05] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:05] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:15] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:25] RECOVERY - Check systemd state on mw2248 is OK: OK - running: The system is fully operational [09:01:45] RECOVERY - Check systemd state on mw2250 is OK: OK - running: The system is fully operational [09:02:02] marostegui: tons of show slave status from nagios [09:02:37] yep, it is kinda hang [09:02:40] and I think I know why [09:03:12] should be goodn ow [09:03:13] now [09:03:15] marostegui: also m3 replica is broken [09:03:25] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:03:25] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:03:25] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:03:25] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:03:25] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:03:26] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:03:26] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:03:27] PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:37] nice! what did you do? :D [09:03:45] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [09:03:45] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:03:45] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:03:45] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:03:45] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:03:46] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:03:53] yep, because of this: https://phabricator.wikimedia.org/T154485 [09:03:55] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:03:55] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:03:55] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:03:55] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [09:03:55] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:04:05] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:04:20] well, because I killed it [09:04:23] marostegui: ok, are you taking care of m3 replica? [09:04:30] yep :) [09:04:31] thanks [09:04:42] great, just to not step on each other toes ;) [09:04:50] thank you, sir! [09:04:56] PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:05:03] no, thank you for jumping in! [09:05:55] PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:06:35] PROBLEM - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:12:55] PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:13:11] sorry about the spam [09:14:25] PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:14:55] PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:15:18] (03CR) 10Hashar: "Yeah it is flapping :( Posting details on T70113" [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [09:16:35] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:16:35] PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:17:55] RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational [09:19:35] PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:19:45] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 264 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:21:29] godog: need some help in restarting? [09:22:12] elukey: thanks! I've switched to stop + start and things should be recovering soon [09:22:23] super, let me know otherwise [09:23:35] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:24:45] 06Operations, 06Analytics-Kanban, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3032454 (10elukey) Details from cp1058: ``` -- VCL_call BACKEND_FETCH -- VCL_return fetch -- FetchError no backend connection -- Timestamp Beresp: 148721... [09:24:45] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 14 probes of 264 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:25:35] RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational [09:25:48] (03PS1) 10Marostegui: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338083 (https://phabricator.wikimedia.org/T156478) [09:26:37] RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational [09:27:36] PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:27:56] RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational [09:27:56] RECOVERY - Check systemd state on mw2159 is OK: OK - running: The system is fully operational [09:28:56] RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational [09:29:06] PROBLEM - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:29:36] RECOVERY - Check systemd state on mw2156 is OK: OK - running: The system is fully operational [09:29:56] PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 8 minutes ago with 3 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX],File[/usr/share/elasticsearch/lib/json-simple.jar] [09:30:26] RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational [09:30:56] RECOVERY - Check systemd state on mw2154 is OK: OK - running: The system is fully operational [09:31:06] RECOVERY - Check systemd state on mw2250 is OK: OK - running: The system is fully operational [09:31:16] PROBLEM - Check systemd state on elastic1051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:31:24] (03PS20) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [09:31:30] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338083 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [09:31:36] PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:31:36] RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational [09:32:00] 06Operations: 'systemctl restart jobrunner' broken via salt - https://phabricator.wikimedia.org/T158288#3032457 (10fgiunchedi) [09:32:16] PROBLEM - Elasticsearch HTTPS on elastic1051 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:32:51] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338083 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [09:33:08] (03CR) 10jenkins-bot: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338083 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [09:33:26] PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:33:29] godog: let's try it with cumin then ;) [09:33:56] PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:33:57] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2070 - T156478 (duration: 00m 41s) [09:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:01] T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478 [09:34:07] 06Operations: 'systemctl restart jobrunner' broken via salt - https://phabricator.wikimedia.org/T158288#3032473 (10fgiunchedi) Updated https://wikitech.wikimedia.org/wiki/Service_restarts#Application_servers_.28also_image.2Fvideo_scalers_and_job_runners.29 with a disclaimer about stop/start [09:34:10] volans: sure! how do I do that? [09:34:19] eqiad is still to go [09:34:51] godog: can wait next week? not yet deployed but will be by EOW hopefully [09:35:11] what I meant was to try the specific case with cumin, to see if has the same issue or not [09:35:38] ah ok, yeah this specific roll-restart can't wait but we can try next week another one for sure [09:35:56] PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:36:09] great, thanks [09:36:36] PROBLEM - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:36:56] RECOVERY - Check systemd state on mw2159 is OK: OK - running: The system is fully operational [09:38:06] volans: mc1019 it then waiting for cumin to be ready \o/ [09:38:36] RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational [09:38:36] RECOVERY - Check systemd state on mw2156 is OK: OK - running: The system is fully operational [09:38:56] RECOVERY - Check systemd state on mw2154 is OK: OK - running: The system is fully operational [09:39:10] !log installing libgc security updates on trusty systems [09:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:45] (03CR) 10Giuseppe Lavagetto: Initial import with the first version (037 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [09:40:26] RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational [09:43:56] PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:43:56] PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:44:56] PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:45:56] RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational [09:46:36] PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:46:56] RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational [09:46:57] RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational [09:47:36] RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational [09:49:36] PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:50:27] nevermind now I get it, puppet is also trying to stop 'jobrunner', I'm looking into it [09:50:41] (03PS1) 10Marostegui: dns: Change db2070 IP [dns] - 10https://gerrit.wikimedia.org/r/338087 (https://phabricator.wikimedia.org/T156478) [09:52:06] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3032508 (10Marostegui) @Papaul please review the DNS changes: https://gerrit.wikimedia.org/r/#/c/338087/ [09:52:16] RECOVERY - puppet last run on elastic1051 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:52:36] RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational [09:53:21] (03PS8) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) [09:54:02] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Change db2070 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338088 (https://phabricator.wikimedia.org/T156478) [09:54:20] sorry there will be a little bit more spam [09:54:21] (03CR) 10Volans: "Thanks for the replies. See inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [09:54:27] (03CR) 10Marostegui: [C: 04-1] "Wait for the server to be off." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338088 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [09:55:59] 06Operations: Unclean stop of jobrunner service via puppet - https://phabricator.wikimedia.org/T158288#3032511 (10fgiunchedi) [09:57:36] PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:58:26] Icinga will keep flapping on an alarm for contint1001 : Work requests waiting in Zuul Gearman server [09:58:36] RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational [09:58:53] we have enabled yesterday night with mutante. It is a bug/unhandled corner case in the check_graphite . Will fix it this afternoon [09:59:00] details are on https://phabricator.wikimedia.org/T70113#3032514 [10:01:36] PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:02:36] RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational [10:02:43] restarted --^ [10:04:26] PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:32] elukey: sadly that's not the problem, it is unclean 'stop' by puppet [10:04:56] PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:26] RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational [10:05:44] I've "fixed" it by doing systemctl reset-failed jobrunner [10:05:56] RECOVERY - Check systemd state on mw2159 is OK: OK - running: The system is fully operational [10:07:16] RECOVERY - Elasticsearch HTTPS on elastic1051 is OK: SSL OK - Certificate elastic1051.eqiad.wmnet valid until 2022-02-15 10:05:51 +0000 (expires in 1824 days) [10:07:16] RECOVERY - Check systemd state on elastic1051 is OK: OK - running: The system is fully operational [10:07:32] 06Operations: Unclean stop of jobrunner service via puppet - https://phabricator.wikimedia.org/T158288#3032525 (10fgiunchedi) The cure for the moment is to 'systemctl reset-failed jobrunner' to restore non-degraded systemd state [10:07:36] PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:07:41] reading the task [10:09:24] godog: one thing that it is not clear to me - why puppet tries to stop the jobrunner? [10:09:38] I suspect because this is codfw [10:10:16] PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:16] RECOVERY - puppet last run on elastic1051 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:13:39] godog: ahhhh [10:13:56] PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:13:57] PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:14:56] PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:15:16] shush [10:15:42] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(48|49|50|51|52).codfw.wmnet [10:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:56] RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational [10:15:56] RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational [10:16:36] RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational [10:16:56] RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational [10:17:18] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(48|49|50|51|52).eqiad.wmnet [10:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:39] !log roll-restart hhvm in eqiad to pick up fluorine -> mwlog1001 changes - T123728 [10:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:43] T123728: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728 [10:34:07] (03CR) 10Ema: [C: 031] Only add the Diamond collector if ISC dhcpd is used [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff) [10:36:16] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [10:37:16] RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time [10:40:22] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3032574 (10Marostegui) [10:40:25] 06Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 13Patch-For-Review: db1019: Decommission - https://phabricator.wikimedia.org/T146265#3032572 (10Marostegui) 05Open>03Resolved I believe this is done [10:46:35] 06Operations, 06Analytics-Kanban, 10Traffic, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3032576 (10elukey) [10:46:46] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [10:47:46] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time [10:54:28] 06Operations, 06Analytics-Kanban, 10Traffic, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3032616 (10elukey) ``` elukey@oxygen:/srv/log/webrequest$ grep piwik archive/5xx.json-20170216 | jq -r '[.http_status,.dt]| @csv' | awk -F":" '{print $1}'| sort | u... [10:58:36] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [10:59:36] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time [11:05:54] I don't understand opensource [11:06:23] graphite-web is a python based renderer which has implementation for bunch of functions such as sumSeries() [11:06:44] and there is another standalone project graphite-api which is python based as well and seems to just have reimplemented everything [11:07:40] err graphite-api is a fork of graphite-web bah [11:08:40] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3032661 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1051.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1051.eqi... [11:08:50] (03CR) 10DCausse: Update elasticsearch module for es5 compatability (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [11:09:04] that's correct, IIRC to have sth easier to deploy than graphite-web [11:10:36] PROBLEM - HHVM jobrunner on mw1164 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [11:11:36] RECOVERY - HHVM jobrunner on mw1164 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [11:17:12] <_joe_> godog: why do you need to restart hhvm to make it pick up the new log destination? [11:17:32] <_joe_> is it a setting in hhvm itself? [11:19:06] _joe_: not a setting in hhvm itself, in this case it is the redis address for the profiler, looks like fluorine was still getting some redis traffic [11:19:15] to answer your question, "I don't know" [11:19:43] <_joe_> godog: uhm the profiler in fact has to do with hhvm itself [11:20:06] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:20:46] some of the traffic did switch yesterday after I did sync-file though, some didn't [11:29:46] PROBLEM - puppet last run on db1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:42:49] (03PS5) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) [11:49:06] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [11:49:06] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:50:06] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [11:52:46] PROBLEM - HHVM jobrunner on mw1166 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [11:53:46] RECOVERY - HHVM jobrunner on mw1166 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.004 second response time [11:56:46] RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:01:23] (03CR) 10Filippo Giunchedi: "A couple of comments on cleanup and one nit, the rest LGTM!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [12:11:36] (03PS2) 10Ladsgroup: gerrit: Make blue buttons look like OOUI [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298) [12:11:38] (03CR) 10Ladsgroup: "@Chad: Added in the phab card" [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298) (owner: 10Ladsgroup) [12:21:11] (03PS6) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) [12:25:32] (03CR) 10jerkins-bot: [V: 04-1] Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [12:28:57] * moritzm shakes fist at pointless "E302 expected 2 blank lines, found 1" CI test [12:29:35] (03PS7) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) [12:30:02] moritzm: why you're angry at PEP8? :-P [12:34:28] do you mind if I review it? :) [12:34:55] (03PS1) 10Hashar: check_graphite anomaly option to set minimum upper band [puppet] - 10https://gerrit.wikimedia.org/r/338095 (https://phabricator.wikimedia.org/T70113) [12:34:56] volans: "How dare you review my code!?" [12:35:10] :D [12:35:42] (03CR) 10Paladox: "I haven't tested a reboot." [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [12:39:46] PROBLEM - HHVM jobrunner on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [12:40:46] RECOVERY - HHVM jobrunner on mw1162 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [12:49:59] (03CR) 10Hashar: "Added as reviewer editors of the check_graphite script. There are a few details on T70113 and a summary in the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/338095 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [12:57:52] 06Operations, 06Labs, 10wikitech.wikimedia.org: Expand list of people who can create new Labs project - https://phabricator.wikimedia.org/T101688#3032860 (10scfc) 05Open>03Resolved >>! In T101688#1390474, @Legoktm wrote: > Do we currently have an issue with projects not being created in a timely manner?... [12:58:10] 06Operations, 06Labs, 10wikitech.wikimedia.org: Expand list of people who can create new Labs project - https://phabricator.wikimedia.org/T101688#3032862 (10scfc) 05Resolved>03declined [12:59:10] (03PS1) 10Muehlenhoff: Update to 1.1.0e [debs/openssl11] - 10https://gerrit.wikimedia.org/r/338096 [13:08:16] (03CR) 10Volans: "Nice!" (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [13:09:00] (03CR) 10Volans: "I forgot to add one :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [13:15:19] (03CR) 10Giuseppe Lavagetto: [C: 031] "Few nitpicks on the README, but LGTM overall. Good job!" (034 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [13:16:56] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 11747 [13:19:52] !log Shutdown db2070 for maintenance - T156478 [13:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:58] T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478 [13:21:36] (03PS9) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) [13:21:52] (03CR) 10Volans: "Nitpicks addressed ;)" [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [13:23:56] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 298 [13:24:18] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Change db2070 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338088 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [13:25:23] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3032910 (10elukey) After running `tcpdump ip6` on a couple of hosts I realized that the puppet agent contacts puppetmaster1001 via IPv6. I added a special term called `puppet` to `analyt... [13:25:48] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Change db2070 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338088 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [13:26:36] PROBLEM - HHVM rendering on mw1266 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 3.135 second response time [13:26:49] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Change db2070 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338088 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [13:27:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db2070 IP as it goes to another rack - T156478 (duration: 00m 56s) [13:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:35] T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478 [13:27:36] RECOVERY - HHVM rendering on mw1266 is OK: HTTP OK: HTTP/1.1 200 OK - 72624 bytes in 0.093 second response time [13:28:24] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db2070 IP as it goes to another rack - T156478 (duration: 00m 41s) [13:28:27] 06Operations, 06Labs, 10wikitech.wikimedia.org: wikitech regularly looses session directly after login - https://phabricator.wikimedia.org/T118395#3032915 (10scfc) 05Open>03Invalid I cannot reproduce this. Please reopen if the problem reoccurs. [13:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:49] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3032918 (10Marostegui) @Papaul db2070 off, mediawiki files changed with its new IP. If you review the DNS patch I will push it too. [13:34:46] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [13:35:46] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time [13:39:36] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [13:40:36] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time [13:42:55] (03PS8) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) [13:43:56] (03CR) 10jerkins-bot: [V: 04-1] Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [13:49:50] (03PS9) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T1400). [14:01:18] nothing for the swat :) [14:04:32] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#3032969 (10Gehel) Relforge1001 is being drained right now, it should be ready in a few hours. Do you need to shut it down? Or is it a hot plug switch? In any case, just ping me before doin... [14:12:26] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3033014 (10elukey) Done a quick check to see how much the mw2* hos... [14:17:45] (03PS2) 10Hashar: Support Jenkins install from 'experimental' component [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429) [14:18:02] (03CR) 10Hashar: "Done and rebased :)" [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429) (owner: 10Hashar) [14:20:56] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 21042 [14:22:56] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [14:26:46] (03PS1) 10Elukey: Move codfw appserver conftool-data to codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/338108 (https://phabricator.wikimedia.org/T156023) [14:27:04] !log uploaded openssl 1.1.0e to apt.wikimedia.org [14:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:18] (03CR) 10Muehlenhoff: [C: 032] Update to 1.1.0e [debs/openssl11] - 10https://gerrit.wikimedia.org/r/338096 (owner: 10Muehlenhoff) [14:28:24] hashar: this was an easy swat ;) [14:29:41] * Nemo_bis is always available to propose fillers for any swat which felt too empty [14:29:43] (03PS2) 10Hashar: contint: remove /srv/ssd [puppet] - 10https://gerrit.wikimedia.org/r/337286 [14:33:10] Nemo_bis: i got enough with my own patches :D [14:35:14] (03PS4) 10Hashar: labstore: check should search for exact mount match [puppet] - 10https://gerrit.wikimedia.org/r/333230 (https://phabricator.wikimedia.org/T155820) [14:35:52] (03CR) 10Hashar: [C: 031] "This has been cherry picked on the CI master for close to a month and fix the issue at end." [puppet] - 10https://gerrit.wikimedia.org/r/333230 (https://phabricator.wikimedia.org/T155820) (owner: 10Hashar) [14:36:24] (03PS4) 10Hashar: Gemfile: add xmlrpc for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 [14:38:21] (03Abandoned) 10Hashar: (WIP) zuul-merger instances (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/336803 (owner: 10Hashar) [14:39:48] (03PS2) 10Hashar: zuul: use a proper require for the merger class [puppet] - 10https://gerrit.wikimedia.org/r/337008 [14:40:05] (03CR) 10Hashar: [C: 031] "rebased/cherry picked to tip of production" [puppet] - 10https://gerrit.wikimedia.org/r/337008 (owner: 10Hashar) [14:44:26] (03PS4) 10Muehlenhoff: Only add the Diamond collector if ISC dhcpd is used [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) [14:45:46] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:45:52] moritzm: dhcpd? :) [14:46:01] either I am having a stroke or you are :P [14:46:40] oh, all those legacy ISC code bases sound alike :) will amend the commit message [14:46:50] heheh [14:47:40] moritzm: also in the comment in timesyncd.pp [14:48:40] thanks, fixed [14:48:47] (03PS5) 10Muehlenhoff: Only add the Diamond collector if ISC ntpd is used [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) [14:52:26] (03CR) 10Volans: [C: 032] "Thanks everyone for the reviews, comments and feedbacks, really appreciated given the size of it in a single change!" [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [14:53:18] (03Merged) 10jenkins-bot: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [14:54:36] PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:58:07] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#3033111 (10Cmjohnson) it's a hot swap disk. I will update the task once it swapped so you can rebuild the raid. [14:58:09] (03CR) 10Muehlenhoff: [C: 032] Only add the Diamond collector if ISC ntpd is used [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff) [15:00:11] (03PS1) 10Filippo Giunchedi: udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728) [15:01:38] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#3033142 (10Gehel) I'll actually just reimage the machine (it is due for a reimage), but same result. [15:02:35] 06Operations, 13Patch-For-Review: Evaluate use of systemd-timesyncd on jessie for clock synchronisation - https://phabricator.wikimedia.org/T150257#3033156 (10MoritzMuehlenhoff) With the merge of https://gerrit.wikimedia.org/r/#/c/337009/ the installation of ISC ntpd is now prevented on stretch. [15:13:47] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:15:32] moritzm: oops, I had some comments [15:15:34] I'll post them anyway [15:15:39] (03CR) 10Faidon Liambotis: Only add the Diamond collector if ISC ntpd is used (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff) [15:17:08] moritzm: also in addition to those comments and semi-relatedly to your change... we'll need to change our ntp *server* classes to also disable timesyncd [15:17:51] I think at this point the thing we should do is rethink all of it a little bit -- perhaps add an "ensure" parameter to all of the ntp client, ntp server and timesyncd classes [15:18:18] that would do the right thing (enable or disable ntp and systemd-timesyncd, add or remove the diamond collector, add or remove the monitoring check etc.) [15:18:46] and then say class { 'ntp::server': ensure => present } class { 'timesyncd': ensure => absent } [15:18:53] I can give it a stab at some point [15:19:28] I need to look at what the collector does on servers, not sure [15:20:01] I can address your comments in a followup patch later on, first need to proceed with the hhvm upload [15:20:14] yes, not urgent [15:21:30] (03PS1) 10Ema: varnish: tune check_varnish_expiry_mailbox_lag alerting thresholds [puppet] - 10https://gerrit.wikimedia.org/r/338123 (https://phabricator.wikimedia.org/T145661) [15:22:36] RECOVERY - puppet last run on mc1016 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:24:46] (03CR) 10BBlack: [C: 031] varnish: tune check_varnish_expiry_mailbox_lag alerting thresholds [puppet] - 10https://gerrit.wikimedia.org/r/338123 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [15:26:26] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 63808 MB (12% inode=99%) [15:27:47] ^ should be transient, a full reindex is in progress [15:28:59] (03PS1) 10Volans: TravisCI: force dependency upgrade [software/cumin] - 10https://gerrit.wikimedia.org/r/338125 (https://phabricator.wikimedia.org/T154588) [15:29:13] (03CR) 10Ema: [C: 032] varnish: tune check_varnish_expiry_mailbox_lag alerting thresholds [puppet] - 10https://gerrit.wikimedia.org/r/338123 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [15:31:46] (03CR) 10Volans: [C: 032] TravisCI: force dependency upgrade [software/cumin] - 10https://gerrit.wikimedia.org/r/338125 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:32:24] (03Merged) 10jenkins-bot: TravisCI: force dependency upgrade [software/cumin] - 10https://gerrit.wikimedia.org/r/338125 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:34:07] (03PS2) 10Filippo Giunchedi: udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728) [15:37:55] (03PS1) 10Volans: Update TravisCI and Coveralls URLs [software/cumin] - 10https://gerrit.wikimedia.org/r/338127 (https://phabricator.wikimedia.org/T154588) [15:39:20] (03CR) 10Volans: [C: 032] Update TravisCI and Coveralls URLs [software/cumin] - 10https://gerrit.wikimedia.org/r/338127 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:40:10] (03Merged) 10jenkins-bot: Update TravisCI and Coveralls URLs [software/cumin] - 10https://gerrit.wikimedia.org/r/338127 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:40:31] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 64248 MB (12% inode=99%) [15:44:10] ACKNOWLEDGEMENT - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 57770 MB (11% inode=99%): Gehel lots of reindex going on, shards are already leaving elastic1024, situation should be back to normal soon - The acknowledgement expires at: 2017-02-17 20:43:31. [15:44:46] marostegui: hello are you ready for me? [15:47:13] papaul: hi! [15:47:36] papaul: yes, the server is off, so you can move it now if you like, if you don't mind reviewing the dns patch, I can get it deployed too now [15:48:30] (03PS1) 10Urbanecm: [throttle] New rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338128 (https://phabricator.wikimedia.org/T158312) [15:48:52] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:50:30] (03PS3) 10Filippo Giunchedi: udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728) [15:51:31] (03CR) 10jerkins-bot: [V: 04-1] udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [15:53:09] !log upgrading mwdebug1001 to HHVM 3.12.14 [15:53:10] (03PS1) 10Reedy: Remove empty conditionals for wikis from flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338129 [15:53:12] (03PS1) 10Reedy: Add a few newlines to standardise spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338130 [15:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:37] (03CR) 10Papaul: [C: 032] dns: Change db2070 IP [dns] - 10https://gerrit.wikimedia.org/r/338087 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [15:54:04] papaul: you deploy or I do it? [15:54:48] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3033317 (10elukey) [15:56:55] (03PS4) 10Filippo Giunchedi: udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728) [16:00:08] (03PS1) 10Filippo Giunchedi: Revert "hieradata: temporarily remove prometheus100[34] from prometheus_hosts" [puppet] - 10https://gerrit.wikimedia.org/r/338131 (https://phabricator.wikimedia.org/T152504) [16:01:11] !log upgrading mwdebug1002 to HHVM 3.12.14 [16:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:05] (03CR) 10Filippo Giunchedi: [C: 032] Revert "hieradata: temporarily remove prometheus100[34] from prometheus_hosts" [puppet] - 10https://gerrit.wikimedia.org/r/338131 (https://phabricator.wikimedia.org/T152504) (owner: 10Filippo Giunchedi) [16:05:43] (03CR) 10Jcrespo: [C: 032] Increase the concurrent threads of large mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/337840 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [16:06:15] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5493/ Will deploy in a hot way, slowly, in number order." [puppet] - 10https://gerrit.wikimedia.org/r/337840 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [16:09:04] (03PS3) 10Jcrespo: Increase the concurrent threads of large mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/337840 (https://phabricator.wikimedia.org/T150474) [16:09:27] (03CR) 10Jcrespo: [C: 032] Increase the concurrent threads of large mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/337840 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [16:09:45] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:15:07] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033362 (10Papaul) [16:15:55] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:24:32] (03CR) 10Dzahn: [C: 04-1] "-1 from ema per the regex not covering up to 2099" [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn) [16:25:58] !log SET GLOBAL thread_pool_size=64; on db1074's mariadb [16:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:04] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033445 (10Marostegui) Hey @RobH To clarify things, db2070 has been moved from row D to row C (as @Papaul updated on the original task description). Thanks for helping out! [16:26:42] (03PS2) 10Dzahn: adjust wikimania regex for mobile hosts, cover 2002-2019 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) [16:27:45] (03PS3) 10Dzahn: adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) [16:28:07] (03PS3) 10Dzahn: zuul: use a proper require for the merger class [puppet] - 10https://gerrit.wikimedia.org/r/337008 (owner: 10Hashar) [16:28:57] (03PS4) 10Jcrespo: phabricator database: Move templates to the role [puppet] - 10https://gerrit.wikimedia.org/r/337827 [16:30:23] !log uploaded HHVM 3.12.14 to apt.wikimedia.org [16:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:45] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:39] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5494/" [puppet] - 10https://gerrit.wikimedia.org/r/337827 (owner: 10Jcrespo) [16:35:27] (03CR) 10Dzahn: [C: 032] zuul: use a proper require for the merger class [puppet] - 10https://gerrit.wikimedia.org/r/337008 (owner: 10Hashar) [16:37:10] (03PS5) 10Jcrespo: phabricator database: Move templates to the role [puppet] - 10https://gerrit.wikimedia.org/r/337827 [16:38:01] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033481 (10RobH) [16:38:24] (03CR) 10Filippo Giunchedi: tlsproxy: add nginx_bootstrap define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333247 (owner: 10Filippo Giunchedi) [16:38:25] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031005 (10RobH) [16:38:40] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3033483 (10Marostegui) db2070: - DNS updated - network/interfaces changed - mediawiki files changed - MySQL up and replication up Pending: port configuration Once the... [16:38:49] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:39:24] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031005 (10RobH) Ok, the new port is setup in row c. Please assign this back to me once db2070 is moved! [16:39:39] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:40:32] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3033489 (10Marostegui) Oh, I saw that @RobH already changed the port and the server is replicating fine! :) [16:40:53] (03PS4) 10Filippo Giunchedi: tlsproxy: add nginx_bootstrap define [puppet] - 10https://gerrit.wikimedia.org/r/333247 [16:40:55] (03PS11) 10Filippo Giunchedi: swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) [16:40:57] (03PS2) 10Jcrespo: Remove the templates dir, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/337837 [16:41:01] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033504 (10Marostegui) a:05Papaul>03RobH The server has been already moved to row C [16:41:32] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3033509 (10Marostegui) a:03Marostegui Claiming this task to do the last checks, repool the server etc before closing it. [16:41:41] (03CR) 10Filippo Giunchedi: swift: terminate https with nginx (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [16:42:48] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033512 (10RobH) >>! In T158246#3033504, @Marostegui wrote: > The server has been already moved to row C When? I just setup (as in when I put in my comment) that the port wasn't allocated or enabled,... [16:43:00] (03CR) 10Jcrespo: [C: 031] "This is ready to deploy, no blockers. This should fix the error: "Warning: Setting templatedir is deprecated. See http://links.puppetlabs." [puppet] - 10https://gerrit.wikimedia.org/r/337837 (owner: 10Jcrespo) [16:44:53] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3033543 (10RobH) [16:44:56] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033541 (10RobH) 05Open>03Resolved [16:45:26] \o/ ^ [16:49:19] (03CR) 10Jcrespo: [C: 031] admin: basic .vimrc for hashar [puppet] - 10https://gerrit.wikimedia.org/r/337014 (owner: 10Hashar) [16:50:18] (03CR) 10Ema: [C: 031] adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn) [16:52:39] (03PS1) 10Filippo Giunchedi: scap: upgrade to 3.5.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/338138 (https://phabricator.wikimedia.org/T127762) [16:54:14] I can be around for the first 15min of puppet swat, anyone else? [16:54:34] (03CR) 10Jcrespo: [C: 031] Gemfile: add xmlrpc for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 (owner: 10Hashar) [16:55:03] I am +1 the ones I can deploy [16:55:12] ^godog [16:55:25] nice, thanks jynus [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T1700). Please do the needful. [17:00:04] hashar: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:11] o / [17:00:16] (03CR) 10Jcrespo: [C: 031] systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [17:00:17] but I am only there for a few :( [17:00:35] there is one I do not want to deploy alone [17:00:40] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:00:42] the one changing systemd [17:00:49] not systemd [17:01:04] yeah I guess it can cause random issue eventually. I only tested it via rspec/cataog compilation [17:01:05] the one declaring initv [17:01:11] might want a whole run of the puppet compiler [17:01:19] oh [17:01:30] RECOVERY - Disk space on elastic1024 is OK: DISK OK [17:01:39] https://gerrit.wikimedia.org/r/#/c/336978/ contint: git-daemon service is 'sysvinit' [17:01:48] need help to test it live [17:01:49] found that one when we provisioned a new zuul::merger on contint2001 [17:01:55] the service did not come up [17:02:00] can we do it now? [17:02:09] sure [17:02:18] let's start with that one [17:02:23] the others are mostly trivial [17:02:34] on the first puppet run the service was not started and systemd was showing up as active (exited) https://phabricator.wikimedia.org/T157785 [17:02:46] we can stop puppet on contint1001 [17:02:47] merge [17:02:48] basically, if it kills conting, you can help me [17:02:51] run puppet on contint2001 [17:02:54] and see what happens [17:03:01] that is ok to me [17:03:15] I am ok to merge it directly [17:03:23] as long as you are on the machine [17:03:28] checking it [17:03:34] and restaring it, etc. [17:03:36] !log stopped puppet on contint1001 for https://gerrit.wikimedia.org/r/#/c/336978/ [17:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:46] ready [17:03:53] I am on both [17:03:54] (03PS2) 10Jcrespo: contint: git-daemon service is 'sysvinit' [puppet] - 10https://gerrit.wikimedia.org/r/336978 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [17:04:10] (03CR) 10Jcrespo: [V: 032 C: 032] contint: git-daemon service is 'sysvinit' [puppet] - 10https://gerrit.wikimedia.org/r/336978 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [17:04:29] will run puppet on contint2001, check zuul-merger is still happily managed by systemd [17:05:29] Invalid service provider 'sysvinit' [17:05:32] ... [17:05:40] really? [17:05:44] do I revert? [17:05:47] Error: Failed to apply catalog: Parameter provider failed on Service[git-daemon]: Invalid service provider 'sysvinit' at /etc/puppet/modules/contint/manifests/zuul/git_daemon.pp:32 [17:05:53] (03PS1) 10Jcrespo: Revert "contint: git-daemon service is 'sysvinit'" [puppet] - 10https://gerrit.wikimedia.org/r/338140 [17:06:02] guess I used the wrong doc bah :( [17:06:06] (03CR) 10Jcrespo: [V: 032 C: 032] Revert "contint: git-daemon service is 'sysvinit'" [puppet] - 10https://gerrit.wikimedia.org/r/338140 (owner: 10Jcrespo) [17:06:38] well, my gut feeling was good [17:06:39] it seems [17:06:40] guess I will redo it later on sorry [17:06:42] :-) [17:07:01] at least the service is still running [17:07:16] I wanted to be here, it is not dangerous [17:07:20] but you know [17:07:40] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:07:41] let me see if there was some other non-trivial [17:08:02] !log reenable puppet on contint1001 [17:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:33] I am not very involved with 331239 [17:08:40] puppet ok on both hosts and zuul-merger are running [17:08:43] I will deploy it, but can be tested? [17:09:01] yeah rebase it [17:09:02] right away? [17:09:11] if CI job rake-jessie says SUCCESS [17:09:13] it is fine to merge :) [17:09:14] (03PS11) 10Jcrespo: puppet parse validate from rake [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [17:09:31] oh, it includes its own modifications? [17:09:32] the idea is to run puppet parser validate / hiera syntax check and erb templates from rake [17:09:37] I didn't know that [17:09:39] so one can locally just rake syntax [17:09:50] and happen to run locally exactly what CI does [17:10:09] lets wait for that [17:10:17] let me see what else we have [17:10:22] that also has the side effect of letting me remove the Jenkins jobs pplint-HEAD and erblint-HEAD that are something like find . -name*.pp | xargs puppet parser validate [17:10:33] that is slow, does not support ignores and not reproducible locally [17:11:08] (03CR) 10Jcrespo: [C: 032] puppet parse validate from rake [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [17:11:13] \O/ [17:11:14] ok [17:11:31] as you own CI, you break it, you fix it! [17:11:34] :-) [17:11:39] I will update https://wikitech.wikimedia.org/wiki/Puppet_coding later tonight :} [17:11:44] thanks [17:12:01] oh CI is more like: folks use it. Sometime abuse it and we try to fix it up [17:12:07] he he [17:12:11] no [17:12:13] 80% of the maintenance is done by ops via puppet anyway [17:12:14] I am ok with that [17:12:17] I am more like [17:12:26] "syntax error on the new function" [17:12:42] I am not worried about new rule is too strict [17:13:04] should be fine. Our puppet manifests are reasonably nice nowadays [17:13:12] the last man standing is the evil import realm.pp [17:13:19] look, it is my job to be pesimistic :-) [17:13:35] specially with codebase I do not normaly touch [17:13:42] understandable [17:13:47] I have deployed now [17:15:21] will be fine [17:15:57] I will chose one server to test rsyslog.conf.erb [17:16:46] it's already live? [17:16:58] nope [17:17:08] I am seeing all use them, right? [17:17:16] mw, eventlogging [17:17:27] feel free to skip that one [17:17:34] if it is has too much potential impact [17:17:35] no [17:17:50] we can just do an escalated deploy [17:17:58] maybe leave it for later [17:18:39] let's do the "easy ones" [17:18:53] can use that one https://gerrit.wikimedia.org/r/#/c/337289/ [17:18:59] changes the jenkins default file [17:19:27] (03PS5) 10Jcrespo: jenkins: sync default file with upstream 1.651.3 [puppet] - 10https://gerrit.wikimedia.org/r/337289 (owner: 10Hashar) [17:19:30] yes, that was easy [17:19:39] I left some extended comment on https://gerrit.wikimedia.org/r/#/c/337289/1/modules/jenkins/files/etc_default_jenkins [17:20:01] a gotcha is I set: PREFIX=/$NAME that would make the web service to use /jenkins/ as base path [17:20:13] but that setting is not passed to the command line; it is hardcoded to --prefix=ci/ [17:21:24] but [17:21:44] is NAME set? [17:21:56] oh, yes [17:21:58] yeah at the top [17:21:58] sorry [17:22:03] :} [17:22:12] I didn't want to overwrite / [17:22:24] the more eyes the better. I think I wrote that one sunday evening [17:22:47] I will have to move out after that one [17:22:53] why not changing the execution line? [17:23:03] so you can be 100% upstream [17:23:08] that is done later on in another patchset [17:23:13] which makes the default an erb template [17:23:23] I wanted to have small incremental changes [17:23:23] no, I mean what it calls this [17:23:31] ok ok [17:23:39] as long as you promise to do it [17:23:43] ultimately jenkins will end up being managed by systemd [17:24:01] and the default file content fully generated from hiera / jenkins::service::config or something like that [17:24:15] I am not sure where to head. But it seems to me hiera is easier to handle than some bash like script [17:24:26] but yeah, baby steps essentially :} [17:24:38] (03CR) 10Jcrespo: [C: 032] jenkins: sync default file with upstream 1.651.3 [puppet] - 10https://gerrit.wikimedia.org/r/337289 (owner: 10Hashar) [17:24:58] running puppet on contint2001 [17:25:12] wait [17:25:13] and disabled it on cont1001 [17:25:19] I am deploying still [17:26:38] restartedjenkins on contint2001 [17:27:23] looks god [17:27:25] everthing ok? [17:27:27] doing same on contint1001 [17:28:19] --webroot=/var/run/jenkins/war --httpPort=8080 --ajp13Port=-1 --prefix=/ci --accessLoggerClassName=winstone.accesslog.SimpleAccessLogger --simpleAccessLogger.format=combined --simpleAccessLogger.file=/var/log/jenkins/access.log [17:28:23] which is good :) [17:28:25] \O/ [17:28:57] thanks a ton ! [17:29:20] what is the strategy for the mount points [17:29:39] that one is empty [17:29:45] it is a leftover [17:29:54] but I gotta escape so we can skip it for now [17:30:02] 337014 admin: basic .vimrc for hashar [17:30:03] 332981 Gemfile: add xmlrpc for ruby 2.4 [17:30:08] are easy / no impact on prod [17:30:19] oh, I missread it [17:30:20] and I think that will be good enough for today swat :} [17:30:41] I compared it as adding /srv/ssd [17:30:52] it is all good [17:30:56] yeah it is from when we had ssd [17:31:03] (03PS3) 10Jcrespo: contint: remove /srv/ssd [puppet] - 10https://gerrit.wikimedia.org/r/337286 (owner: 10Hashar) [17:31:14] the last user was zuul-merger on scandium. But that got phased out :} [17:31:30] I can also deploy the user dir change, no problem [17:31:36] neat [17:32:25] (03CR) 10Jcrespo: [V: 032 C: 032] contint: remove /srv/ssd [puppet] - 10https://gerrit.wikimedia.org/r/337286 (owner: 10Hashar) [17:33:04] (03PS2) 10Jcrespo: admin: basic .vimrc for hashar [puppet] - 10https://gerrit.wikimedia.org/r/337014 (owner: 10Hashar) [17:33:15] (03CR) 10Jcrespo: [C: 032] admin: basic .vimrc for hashar [puppet] - 10https://gerrit.wikimedia.org/r/337014 (owner: 10Hashar) [17:33:43] (03CR) 10Jcrespo: [V: 032 C: 032] admin: basic .vimrc for hashar [puppet] - 10https://gerrit.wikimedia.org/r/337014 (owner: 10Hashar) [17:34:37] jynus: thx. I gotta run out now sorry :/ [17:34:43] bye! [17:36:50] (03PS5) 10Jcrespo: Gemfile: add xmlrpc for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 (owner: 10Hashar) [17:39:23] (03CR) 10Jcrespo: [V: 032 C: 032] Gemfile: add xmlrpc for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 (owner: 10Hashar) [17:43:00] puppet swat is done, but evil swatter rejected my CR :-( https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1530651&oldid=1530254 [17:46:09] Aww, mine didn't get carried over from the missed puppetswat [17:46:12] Next week I guess [17:46:49] RainbowSprinkles, which one? [17:47:31] https://gerrit.wikimedia.org/r/#/c/332707/ - updates a bunch of links in (mostly HTML pages of sorts) to use HTTPS instead of protocol-relative URLs [17:47:51] It's all internal-to-WMF links, so we know HTTPS exists and isn't going away [17:48:36] risk-wise I would be ok to deploy that [17:48:43] but I am not sure I agree with it [17:50:00] maybe if there was a better reason e.g. we want to hardcode https in case of X or something [17:50:25] or, it breaks X, Y and Z [17:51:09] Eh, not so much a reason other than being pedantic and consistent. [17:51:37] ie: If I were writing this file today, I wouldn't have used protocol-relative URLs [17:52:17] it's better to just use https: everywhere, so we're not relying on sts-preload to save us [17:52:36] bblack, if you are ok with it, I will deploy it [17:52:47] I just didn't see a strong reason to do it [17:52:49] the only caveat, and the reason we don't make a simple policy announcement of https:// -on-everything, is that some internal-only stuff still doesn't speak https [17:53:12] bblack: Indeed, this isn't touching anything like that though [17:53:24] This is all links to wikis or other known-https stuff [17:53:28] right [17:53:29] Links to meta, wmfwiki, etc [17:53:40] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:17] ok, for me context like that would matter, being in the commit message would have eased me up [17:54:35] jynus: basically if you use an http:// link to get somewhere, there's a chance for a mitm to hijack the redirect to https and do Bad Things. Initial access via https is better. [17:54:57] so, in general we prefer to harcode https [17:55:05] HSTS and STS-preload are designed to minimize that risk (browser internally translates http to https because it knows we're on the https-only list, basically) [17:55:10] unless that is not availalbe, right? [17:55:26] which is not the case here [17:55:40] right, HSTS only works after their first (un-hacked) visit, and STS-preload isn't there in all possible user agents, just modern widespread browsers (FF, Chrome, IE11) [17:55:53] Assuming you aren't benefiting from HSTS/STS, you can avoid a MITM on the mixed content operations/puppet/modules/publichtml/templates/index.html.erb [17:56:02] ok, that makes sense [17:56:03] Which loads some images from upload.wm.o [17:56:15] let me have a quick look at all the domains changed [17:56:26] in case there is one odd [17:56:39] (03PS4) 10Jcrespo: Swap from protocol-relative urls to https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/332707 (owner: 10Chad) [17:56:54] (03CR) 10Nemo bis: "I guess the commit message might as well claim to address T54253. :)" [puppet] - 10https://gerrit.wikimedia.org/r/332707 (owner: 10Chad) [17:57:55] FWIW, discovering protocol relative URLs was *awesome* in the transition period before we supported HTTPS-by-default-for-everyone :) [17:58:09] (I remember finding that in an RFC and being like WTF NO WAY THAT ROCKS) [17:58:09] ha ha [17:58:42] I have to go to a meeting [17:58:48] Nemo_bis: #til about T54253 [17:58:49] T54253: Protocol-relative URLs are poorly supported or unsupported by a number of HTTP clients - https://phabricator.wikimedia.org/T54253 [17:59:00] can you amend that, which is probably a good suggestion [17:59:10] and I will deploy in 30 minutes or so [17:59:18] only the commit message [17:59:33] I'll amend the commit message, yeah one min [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T1800). [18:00:23] I've got a deployment of ORES. It should be easy. [18:00:40] (03PS5) 10Chad: Swap from protocol-relative urls to https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/332707 (https://phabricator.wikimedia.org/T54253) [18:00:50] halfak: DO NOT JINX YOURSELF! ;) [18:01:08] Good point. I'm sure there'll be problems [18:01:08] "None of these URLs will ever go back to being non-https. Also, per the linked task, not all clients behave well with protocol-relative URLs, so avoiding them except when absolutely necessary is good for them." [18:01:10] first rule of deployments: nothing is easy :) [18:01:15] * halfak looks both ways -- shifty-eyed [18:02:14] jynus: Amended. I'll be around in ~30 when you're back. Thanks [18:03:41] !log deploying ores:e9bbda3 [18:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:48] !log halfak@tin Started deploy [ores/deploy@e9bbda3]: (no justification provided) [18:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:34] halfak: scap now auto-logs for all deploys ^^ [18:05:40] the start and end [18:05:59] Thanks greg-g. I'll remove that from our deploy script :) [18:06:12] Canary looks good. Moving forward [18:06:55] Oh. Looks like we're still restarting the service. [18:09:23] OK Confirmed canary moving forward now [18:15:34] halfak: Also, you can include a message in what you're doing and that's what'll be in IRC/SAL instead of (no justification provided) [18:15:43] `scap deploy "My awesome message is here"` [18:15:55] Gotcha. Will add that to the deploy script. [18:15:55] (03PS10) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) [18:16:07] Do you think a phab task link is a good message? [18:16:16] That's a great message [18:16:26] OK will do. :) [18:16:27] Anything that provides context to someone reading later is a good message :) [18:16:38] Was thinking the phab task would be perfect for that :) [18:16:39] Phab tasks, gerrit changes [18:16:40] Etc [18:17:37] ¡log halfak@tin Started deploy [ores/deploy@e9bbda3]: T1234 [18:17:37] T1234: Restrict Bugzilla access to read-only - https://phabricator.wikimedia.org/T1234 [18:17:39] eg ^ [18:17:56] also, phab task, eg just doing "scap deploy "rollout for T12345" makes stashbot mention the deploy ont he task, a la https://phabricator.wikimedia.org/T155527#3029942 [18:17:56] T12345: Create "annotation" namespace on Hebrew Wikisource - https://phabricator.wikimedia.org/T12345 [18:17:58] Oh nice. It know phab task shape [18:18:13] Oh yeah, stashbot too [18:18:14] :D [18:20:00] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:50] * halfak waits for "promote and restart" [18:20:53] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#3033929 (10Krinkle) Next steps: * [ ] Put node-warmup script in... [18:20:59] Deploy script is updated. [18:21:15] RainbowSprinkles, I don't suppose I could go edit past messages to associate the task, could I? [18:22:00] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:22:18] (03CR) 10VolkerE: gerrit: Make blue buttons look like OOUI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298) (owner: 10Ladsgroup) [18:22:20] halfak: You could edit the entry on the SAL on wikitech, but it wouldn't update the logstash store of it [18:22:40] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:22:41] (or if we still have the twitter bridge, it wouldn't edit that) [18:22:56] RainbowSprinkles, gotcha. Will leave it for now. [18:23:04] and make sure to do it next time [18:23:52] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add account validation script / cron job (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [18:27:10] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [18:27:50] hashar ^^ [18:27:56] oh he's not online [18:28:23] anyways there dosent look like any tests running on https://integration.wikimedia.org/zuul/ (by that i mean nodepool dosent seem to be working) [18:29:12] ores deploy successful [18:29:16] \o/ [18:29:24] (03PS1) 10Muehlenhoff: Drop require_package for python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/338150 [18:29:50] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:30:28] (03CR) 10Muehlenhoff: [V: 032 C: 032] Drop require_package for python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/338150 (owner: 10Muehlenhoff) [18:31:18] moritzm: what was the problem with the ldap dependency? [18:31:34] just out of curiosity given I suggested to add tehm [18:32:40] Duplicated declaration, see commit message [18:32:49] (03CR) 10Jcrespo: [C: 031] "They are all wiki sites, upload, tools and wikimedia portal." [puppet] - 10https://gerrit.wikimedia.org/r/332707 (https://phabricator.wikimedia.org/T54253) (owner: 10Chad) [18:32:56] (03CR) 10Chad: "So, the master cleanup bit needs two passes, as the localization cache files are owned by a different user. Sucks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) (owner: 10Chad) [18:33:10] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [18:33:16] utils.pp could also be cleaned up to use require_package, but I rather wanted to resolve thd puppet failure quickly [18:33:30] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [18:33:49] (03PS6) 10Jcrespo: Swap from protocol-relative urls to https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/332707 (https://phabricator.wikimedia.org/T54253) (owner: 10Chad) [18:33:50] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:35:08] sure, no problem, I though that require_package was ok with multiple declarations, and I guess that the error is due because utils.pp uses package() [18:35:18] yeah, that's the problem [18:35:31] :) [18:36:35] (03PS2) 10Jcrespo: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337860 [18:36:47] (03Abandoned) 10Jcrespo: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337860 (owner: 10Jcrespo) [18:38:48] (03CR) 10Dzahn: [C: 031] udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [18:44:38] (03CR) 10Dzahn: "10:42 - stop puppet" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [18:46:37] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#3034051 (10Volans) >>! In T156922#3033929, @Krinkle wrote: > *... [18:47:18] 06Operations, 06Operations-Software-Development, 07HHVM, 13Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#2571031 (10Volans) What is the status of `terbium`? From the summary it appears to have been upgraded but the host is still a `trusty`. [18:48:00] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:48:30] RECOVERY - Disk space on labnet1001 is OK: DISK OK [18:50:00] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:50:44] !log stop noodepool to reset state on pool [18:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:45] !log clean out nodepool instances [18:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:04] (03CR) 10Jcrespo: [V: 032 C: 032] Swap from protocol-relative urls to https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/332707 (https://phabricator.wikimedia.org/T54253) (owner: 10Chad) [18:53:50] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [18:54:50] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T1900). Please do the needful. [19:02:20] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [19:06:10] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:07:45] !log bump up nodepool allocated fixed ips set (I think it exhausted them errantly somehow?) [19:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:16] (03PS1) 10Volans: Add .gitreview file for Gerrit [software/cumin] - 10https://gerrit.wikimedia.org/r/338153 (https://phabricator.wikimedia.org/T154588) [19:09:37] (03CR) 10Chad: [C: 031] "Minor comment inline, but ok as-is." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/338153 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [19:10:39] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3034161 (10Nithum) Hi Rob, could you change the ssh public key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDHQ6oDkb1WXmbizF6PX4hIELg7azLCcAaNiIl2ytjKTv7Dcun... [19:11:07] RainbowSprinkles: oh nice! didn't know about it and yes I'm using at least another branch [19:11:16] New-ish feature :) [19:11:25] Lots of repos don't use it yet [19:11:35] so just use track instead of defaultbranch? [19:11:40] Yeah [19:11:51] so no need to change it in other branches [19:11:52] Benefit means when you make a new branch you don't need to update gitreview file [19:11:52] nice! [19:11:54] Yep [19:12:10] thanks for the review then, changing it immediately :D [19:12:10] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 0 below the confidence bounds [19:12:21] volans: We started using it on MW core + extensions so when we do our weekly branches we didn't have to do 100 dummy edits and commits [19:12:22] !log restarting kartotherian / tilerator on maps-test* [19:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:41] Saved tons of round-trips [19:13:09] thcipriani / RainbowSprinkles / thcipriani I'm ok to push the new scap version shortly btw, maybe after swat if that's still on [19:13:21] jouncebot: next [19:13:21] In 0 hour(s) and 46 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T2000) [19:13:21] (03PS2) 10Volans: Add .gitreview file for Gerrit [software/cumin] - 10https://gerrit.wikimedia.org/r/338153 (https://phabricator.wikimedia.org/T154588) [19:13:49] !log clean out /var/log/ on labnet1001 as it filled up [19:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:10] godog: Nothing was on swat for today [19:14:16] (03CR) 10Dzahn: [C: 04-1] "compiler says" [puppet] - 10https://gerrit.wikimedia.org/r/337388 (owner: 10Hashar) [19:14:24] We could go ahead now, train doesn't start for another 45m [19:14:57] (03CR) 10Dzahn: [C: 04-1] "compiler says: "Error: Must pass http_port to Class[Contint::Proxy_jenkins] at /mnt/jenkins-workspace/puppet-compiler/5495/change/src/modu" [puppet] - 10https://gerrit.wikimedia.org/r/337388 (owner: 10Hashar) [19:15:29] https://fat.gfycat.com/DefinitiveSomeKingbird.webm [19:15:54] awesome! :D [19:16:02] I want one of those [19:16:06] No real reason, just seems cool [19:16:23] that's how trains mate, I'm told [19:18:46] ok going ahead with reprepro and the puppet patch [19:19:16] something wrong in Zuul? seems there are a lot of waiting checks [19:19:20] godog: My favorite train meme is the photo on: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys [19:19:26] volans: See #-releng [19:19:36] Nodepool is wonky, think it's slowly catching up [19:20:10] RainbowSprinkles: ok, thanks, I was not there [19:20:30] yw. Yeah, it's backed up but known. Hopefully unwinding its backlog now... [19:21:56] (03CR) 10Dzahn: [C: 031] Remove the templates dir, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/337837 (owner: 10Jcrespo) [19:22:21] (03PS2) 10Filippo Giunchedi: scap: upgrade to 3.5.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/338138 (https://phabricator.wikimedia.org/T127762) [19:23:30] (03CR) 10Chad: "Commit message nit, but otherwise ok" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337837 (owner: 10Jcrespo) [19:23:50] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] scap: upgrade to 3.5.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/338138 (https://phabricator.wikimedia.org/T127762) (owner: 10Filippo Giunchedi) [19:24:08] (03CR) 10Chad: Remove the templates dir, not needed anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337837 (owner: 10Jcrespo) [19:25:07] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3034223 (10Ottomata) It should! But I haven’t tried it. General options: -C | -P | -L Mode: Consume, Produce or metadata List -G Mode: High-level KafkaConsumer (Kafka 0.... [19:25:10] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [19:25:27] ^:) [19:25:45] jynus: Oh, I didn't see you merge my change a bit ago re: https links. Thx! [19:32:18] (03PS1) 10Jgreen: rename backup4001 to frbackup4001 for clarity [dns] - 10https://gerrit.wikimedia.org/r/338156 [19:32:50] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 627 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3719929 keys, up 108 days 11 hours - replication_delay is 627 [19:33:10] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3719921 keys, up 108 days 11 hours - replication_delay is 648 [19:33:23] (03CR) 10Jgreen: [C: 032] rename backup4001 to frbackup4001 for clarity [dns] - 10https://gerrit.wikimedia.org/r/338156 (owner: 10Jgreen) [19:35:10] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [19:35:36] RainbowSprinkles: 3.5.2-1 is on tin btw [19:37:50] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3695310 keys, up 108 days 11 hours - replication_delay is 0 [19:38:20] godog: Confirmed, lgtm [19:39:10] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3695104 keys, up 108 days 11 hours - replication_delay is 0 [20:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T2000). [20:00:32] * thcipriani does. [20:04:59] 06Operations, 06Labs, 06Release-Engineering-Team: contintcloud project thinks it is using 206 fixed-ip quota errantly - https://phabricator.wikimedia.org/T158350#3034394 (10chasemp) [20:05:04] 06Operations, 06Labs, 06Release-Engineering-Team: contintcloud project thinks it is using 206 fixed-ip quota errantly - https://phabricator.wikimedia.org/T158350#3034406 (10chasemp) p:05Triage>03High [20:06:00] 06Operations, 06Labs, 06Release-Engineering-Team: contintcloud project thinks it is using 206 fixed-ip quota errantly - https://phabricator.wikimedia.org/T158350#3034394 (10chasemp) a:03Andrew currently nodepool is going along fine except the quota is clearly wrong. I don't yet understand why the current... [20:10:53] (03PS1) 10Thcipriani: group1 wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338161 [20:10:55] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338161 (owner: 10Thcipriani) [20:13:16] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338161 (owner: 10Thcipriani) [20:13:45] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.12 [20:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:32] (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338161 (owner: 10Thcipriani) [20:37:34] (03CR) 10Volans: [C: 032] Add .gitreview file for Gerrit (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/338153 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [20:38:25] (03Draft1) 10Paladox: Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163 [20:38:31] (03PS2) 10Paladox: Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163 [20:39:02] (03CR) 10Dzahn: "can confirm this from tests done on labs instance" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox) [20:39:06] (03Merged) 10jenkins-bot: Add .gitreview file for Gerrit [software/cumin] - 10https://gerrit.wikimedia.org/r/338153 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [20:39:37] (03CR) 10Dzahn: "i went to phab2001 to check the status there, and" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox) [20:40:07] (03CR) 10Paladox: "> i went to phab2001 to check the status there, and" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox) [20:40:42] (03CR) 10Gehel: [C: 04-1] "Some (mostly minor) comments, see inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [20:44:10] PROBLEM - Juniper alarms on asw-ulsfo.mgmt.ulsfo.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms [20:45:10] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [20:47:32] 06Operations, 10ops-ulsfo, 10netops: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#3034504 (10RobH) Ok, I took a redundant supply from cp4007 and installed it into lvs4002 power supply 2 slot. Less than a minute later, the system killed the new power supply. Record: 1022 Da... [20:47:39] joFeb 16 20:38:11 phab2001 systemd[1]: [/etc/systemd/system/phd.service:5] Unknown lvalue 'User' in section 'Unit' [20:48:03] (03PS3) 10Paladox: Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163 [20:48:12] i didn't mean to paste that, but yes ^ [20:48:28] that is on phab2001 . iridium is ok, not jessie [20:49:10] 06Operations, 10ops-eqiad, 06Services (watching): Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425#3034511 (10Eevans) Is there an ETA on this? We have some testing as a part of T156199 that could benefit from this environment; Having some idea would help with planning these tasks. [20:52:22] (03PS4) 10Paladox: Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163 [21:06:08] (03PS1) 10Thcipriani: all wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338168 [21:06:10] (03CR) 10Thcipriani: [C: 032] all wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338168 (owner: 10Thcipriani) [21:07:27] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338168 (owner: 10Thcipriani) [21:07:36] (03CR) 10jenkins-bot: all wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338168 (owner: 10Thcipriani) [21:08:02] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.12 [21:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:23] anybody knows whether mw* hosts are time-synced? I get edits on test.wikidata.org which are 20 secs in the past [21:18:56] SMalyshev: they _should_ be time synced [21:19:56] ok maybe my local clock is broken then... [21:30:10] (03CR) 1020after4: [C: 031] Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox) [21:39:52] !log Deleted around 9500 pre 2013 captchas [21:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:58] !log make that 2017 [21:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:56] (03PS1) 10MaxSem: Tabular data license CC0-1.0+ -> CC0-1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338208 (https://phabricator.wikimedia.org/T154075) [21:51:26] (note: this is just a joke i know this is a serious channel but i think 1 message wont hurt) Reedy, time traveling while doing tasks is discouraged and can cause confusion please avoid time traveling until your are done with the task at hand :P [21:51:54] Zppix: The funny thing is the captchas were actually from 2014 [21:51:55] *2013 [21:51:56] ffs [21:52:03] Hence my slip up [21:52:35] (03CR) 10Dzahn: "much better, but on reboot it fails because mysql is not started first" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox) [21:53:47] Reedy i figured as much, but until then time traveling privelges are suspended [21:57:19] (03PS7) 10Dzahn: jenkins: support variable prefix setting [puppet] - 10https://gerrit.wikimedia.org/r/337307 (owner: 10Hashar) [21:58:49] wowwwww [21:58:58] fatalmonitor looks boringly clean! [21:59:48] MaxSem want me to fix that for you? [22:00:04] MaxSem and jgirault: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T2200). Please do the needful. [22:00:38] I'mv gonna try to fix that right now! [22:00:51] (03PS5) 10Paladox: Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163 [22:04:01] !log phab2001 - start/stop phd, testing gerrit 338163 [22:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:51] (03CR) 10Dzahn: [C: 032] "works on phab2001 and it was tested on labs that services come back after reboot now, thank you for this fix" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox) [22:08:01] (03CR) 10Dzahn: ". correction.. still an issue on reboot, needs follow-up fix, but this was not wrong, it was needed too" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox) [22:09:29] (03PS4) 10Dzahn: adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) [22:11:01] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5497/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/337307 (owner: 10Hashar) [22:11:19] (03PS8) 10Dzahn: jenkins: support variable prefix setting [puppet] - 10https://gerrit.wikimedia.org/r/337307 (owner: 10Hashar) [22:16:45] (03PS3) 10Smalyshev: [DNM] [WIP] Allow SPARQL endpoint to be queries via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) [22:17:58] (03PS4) 10Smalyshev: [DNM] [WIP] Allow SPARQL endpoint to be queries via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) [22:18:05] dear Zuul, how much sacrifice do you need? [22:18:23] (03CR) 10Smalyshev: [C: 04-1] "Not to be deployed until Blazegraph patch for X-BIGDATA-READ-ONLY support is merged." [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) (owner: 10Smalyshev) [22:18:37] (03CR) 10Smalyshev: [C: 04-1] [DNM] [WIP] Allow SPARQL endpoint to be queries via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) (owner: 10Smalyshev) [22:19:42] mutante: if you are around? I am willing to remove the Zuul gearman icinga probe. It is useless after all [22:20:00] mutante: I should just use a threshold instead of the anomaly detector as you suggested yesterday [22:20:03] MaxSem here have my lamb see if the zuul gods will be appeased [22:20:07] hashar: i am around, and i am testing the latest merge on contint2001 [22:20:14] while i stopped puppet on 1001 for a moment [22:20:20] ahh [22:20:25] IT WORKED [22:20:26] yeah that was my morning hack [22:20:30] MaxSem your welcome [22:20:48] I probably should document it a bit more. But the idea is to have a minimum threshold for the anomaly detection [22:20:58] hashar: there is a problem with https://gerrit.wikimedia.org/r/#/c/337307/8/modules/contint/templates/apache/proxy_jenkins.erb [22:21:07] there are no new lines [22:21:12] 7 ProxyPass /ci http://localhost:8080/ciProxyPassReverse /ci http://localhost:8080/ciProxyRequests Of f [22:21:37] it ends up on a single line in /etc/apache2/jenkins_proxy [22:23:00] mutante i had that same problem [22:23:06] (03CR) 10Dzahn: "somehow there are missing new lines in the resulting /etc/apache2/jenkins_proxy" [puppet] - 10https://gerrit.wikimedia.org/r/337307 (owner: 10Hashar) [22:23:14] remove the - - lines from <%= @prefix -%> [22:23:20] hashar ^^ [22:23:33] :( [22:23:56] hashar should be fixable by removing the - <%= @prefix -%> -> <%= @prefix %> [22:23:58] (03CR) 10MaxSem: [C: 032] Tabular data license CC0-1.0+ -> CC0-1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338208 (https://phabricator.wikimedia.org/T154075) (owner: 10MaxSem) [22:24:42] why do I always get those .erb things wrong :( [22:24:47] i had that problem on the logstash change for gerrit. [22:25:16] hashar: re: icinga check. all up to you, we can remove it or ACK it a little longer if we see a chance to fix it later [22:26:55] (03Merged) 10jenkins-bot: Tabular data license CC0-1.0+ -> CC0-1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338208 (https://phabricator.wikimedia.org/T154075) (owner: 10MaxSem) [22:27:11] (03CR) 10jenkins-bot: Tabular data license CC0-1.0+ -> CC0-1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338208 (https://phabricator.wikimedia.org/T154075) (owner: 10MaxSem) [22:27:55] (03PS1) 10Hashar: contint: keeping trailing new line in proxy_jenkins [puppet] - 10https://gerrit.wikimedia.org/r/338274 [22:28:04] paladox: mutante: https://gerrit.wikimedia.org/r/338274 should keep the newlines [22:28:51] mutante: I will just drop the icinga check. Preparing a patch for that. If you want the details https://phabricator.wikimedia.org/T70113#3034630 [22:29:06] the anomaly band closely follow the raising metrics, and thus there is no anomaly :] [22:29:35] ah! yea [22:29:47] (03CR) 10Paladox: contint: keeping trailing new line in proxy_jenkins (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/338274 (owner: 10Hashar) [22:29:49] thanks [22:29:51] so then let's try with a fixed "max number of jobs" [22:29:52] I would need a better way, most probably just a threshold [22:30:03] I asked releng team for some feedback about it [22:30:15] so I guess we will come back with a better patch :] [22:30:38] ok! [22:30:59] but maybe that is just due to the holtWintersConfidenceBand being hard set to a delta=5 [22:31:21] alright, let's do the template fix for now [22:31:25] so I will have to put a bit more thoughts in it [22:31:26] looks like there are some more lines with that [22:31:33] !log maxsem@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/338208/ (duration: 00m 53s) [22:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:46] ahr [22:31:50] paladox found some more cases [22:32:00] Yep [22:32:09] ah no [22:32:13] but some of them are NOT supposed to be new lines [22:32:14] @paladox [22:32:15] because they are inside a line [22:32:20] Yep [22:32:21] so we actually dont want newlines in the others [22:32:25] line 14 is right though [22:32:36] !log maxsem@tin Synchronized php-1.29.0-wmf.12/extensions/JsonConfig/: https://gerrit.wikimedia.org/r/#/c/338013/ (duration: 00m 42s) [22:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:48] *> [22:32:54] there is trailing *> [22:32:55] line 14 should be changed, line 40-44 should stay [22:33:15] (03PS5) 10Zppix: adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn) [22:33:18] used to be [22:33:32] oh! [22:33:32] so we need to keep the *> on the same line dont we? [22:33:39] !log maxsem@tin Started scap: Update messages for https://gerrit.wikimedia.org/r/#/c/338013/ [22:33:39] you are right, yes [22:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:53] eventually I will have some rspec tests and will probably add some for the template [22:34:02] (03CR) 10Dzahn: [C: 032] contint: keeping trailing new line in proxy_jenkins [puppet] - 10https://gerrit.wikimedia.org/r/338274 (owner: 10Hashar) [22:34:12] paladox: thanks for the hint! [22:34:19] Your welcome :) [22:34:37] for the icinga alarm, you are right lets ack it for a week ? [22:34:50] making sure it stay acknowledged on recovery [22:35:00] PROBLEM - Disk space on tin is CRITICAL: DISK CRITICAL - free space: / 582 MB (1% inode=78%) [22:35:05] will revisit it and come with a proper fix for the anomaly check next week [22:35:12] ok, so the remaining diff is: [22:35:13] -PREFIX=/$NAME [22:35:13] +PREFIX=/ci [22:35:21] -JENKINS_ARGS="--webroot=/var/run/jenkins/war --httpPort=$HTTP_PORT --ajp13Port=$AJP_PORT --prefix=/ci $JENKINS_ACCESSLOG_ENABLE" [22:35:24] +JENKINS_ARGS="--webroot=/var/run/jenkins/war --httpPort=$HTTP_PORT --ajp13Port=$AJP_PORT --prefix=$PREFIX $JENKINS_ACCESSLOG_ENABLE" [22:35:27] looks good to me [22:35:30] yeah [22:35:43] now i am enabling puppet on contin1001 [22:35:45] to apply it there [22:35:46] that was a source of confusion. Tripped on it earlier today during the puppet swat [22:36:04] surely setting PREFIX to a wrong value was confusing, but that is just because before PREFIX was not used [22:36:08] \o/ [22:36:26] ok, it's done [22:36:27] I dont know whether puppet reload apache, i think it does [22:36:32] anyway that is a noop for ci itself [22:36:40] hmm. it did not [22:36:43] only can break the https://integration.wikimedia.org/ [22:37:08] i restarted apache to be sure [22:37:08] and the various sub pages the proxy to jenkins https://integration.wikimedia.org/ci/ or the proxy to zuul https://integration.wikimedia.org/zuul/status.json [22:37:10] done [22:37:29] looks fine :) [22:37:42] great [22:37:57] for the context all that serie of patches has two goals: [22:38:05] hook jenkins behind systemd [22:38:10] PROBLEM - Check systemd state on restbase2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:38:10] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:38:10] PROBLEM - cassandra-a CQL 10.192.16.186:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.186 and port 9042: Connection refused [22:38:11] PROBLEM - cassandra-a SSL 10.192.16.186:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:38:11] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [22:38:20] PROBLEM - cassandra-a service on restbase2010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [22:38:21] and eventually let us have multiple jenkins instances on a single host [22:38:28] "making sure it stays acknowledeged after recovery" does not work [22:38:34] ACK is always "until next status change" [22:38:38] ah [22:38:40] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused [22:38:40] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:38:43] so I guess lets disable it [22:38:43] but we can "schedule downtime" [22:38:47] with a similar effect [22:38:47] oh [22:39:03] i will do that now.. downtime until next week [22:39:03] guess we can consider it down for a week so :] [22:39:54] (03CR) 10Ladsgroup: gerrit: Make blue buttons look like OOUI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298) (owner: 10Ladsgroup) [22:40:21] This service has been scheduled for fixed downtime from 2017-02-16 22:39:31 to 2017-02-23 00:39:31. Notifications for the service will not be sent out during that time period. [22:41:40] awesome. and sorry for all the trouble [22:41:50] I should have been more careful and actually play test the command on my local machine [22:42:03] no problem at all [22:42:05] I did that this morning, even retrievied the raw metrics from statsd, played with them all locally [22:42:15] more work = more things to break [22:42:15] I think I was expecting things to work all magically [22:42:53] one step closer to multiple jenkins on one host :) [22:42:59] yeah hopefully [22:43:19] but I am distracting you, you might want to look at the restbase alarms above [22:43:25] paladox: gerrit and button color sounds like something for you :) [22:43:52] mutante yep, i've applied it on gerrit-test3. but i doint notice a different [22:43:57] difference [22:45:05] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034831 (10hashar) [22:45:29] (03CR) 10VolkerE: "In general I'd recommend not start aligning this tool with WMUI style guide, as it would go far beyond colors, if we want to do it right™." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298) (owner: 10Ladsgroup) [22:46:00] RECOVERY - Disk space on tin is OK: DISK OK [22:46:04] !log tin - apt-get clean - 4.6G avail (T158359) [22:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:56] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034851 (10hashar) ``` $ du -h /var/lib/l10nupdate/caches/ 1.5G /var/lib/l10nupdate/caches/cache-1.29.0-wmf.2 1.5G /var/lib/l10nupdate/caches/cache-1.29.0-wmf.... [22:47:05] oh nice [22:47:29] RainbowSprinkles: twentyafterfour: Reedy: arent we supposed to clean the old l10nupdate caches ? [22:47:52] hashar: yes [22:47:52] probably [22:47:58] Doesn't scap do it? [22:48:00] there was once a cron to do that [22:48:05] the one from 1.29.0-wmf.1 is from November 10th [22:48:08] and it is still on tin :( [22:48:11] it's been this multiple times :p [22:48:16] https://phabricator.wikimedia.org/T158358#3034851 [22:48:17] (03CR) 10Paladox: "Actually it works on reboot now. Just that i had puppet disabled. After enabling it, rebooting works :)" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox) [22:48:19] i thought we have a cron that deletes it [22:48:20] Reedy: no, /var/lib/l10nupdate is the l10nupdate cron job [22:48:25] because it happened before [22:48:54] Scap should just delete those when it deletes the /srv/mediawiki-staging branches [22:49:17] scap shouldn't have to know about l10nupdate unless we fold all that crap into scap [22:49:22] (03CR) 10Dzahn: "because... /var/run/phd/pid was owned by root, and once that was deleted and puppet re-created it, and "phd" user could own the pid file.." [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox) [22:49:28] we really should just stop doing l10nupdate [22:49:32] lol [22:49:40] Can't we do it as as a scap plugin that hooks into that function? [22:49:48] i expected it's a cron that runs find .. and deletes older than X [22:49:53] its usefulness is pretty low with the weekly branch cadence [22:50:09] Was there a task created about it yet? [22:50:51] T130317 [22:50:51] T130317: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317 [22:50:56] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034858 (10hashar) From IRC supposedly we had a cron job to garbage collect the old caches. ``` $ sudo -u l10nupdate -s crontab -l 0 2 * * * /usr/local/bin/l10nupdat... [22:51:00] and T133913 [22:51:00] T133913: Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913 [22:51:05] scap plugin WIP for cleaning up old branches [22:51:07] `scap clean` [22:51:13] New features up for review [22:51:16] and T119747 [22:51:17] T119747: deleteMediaWiki should delete /var/lib/l10nupdate/caches/cache-$wmgVersionNumber - https://phabricator.wikimedia.org/T119747 [22:51:20] It's on the train docs [22:51:23] I was meaning for turning it off? [22:51:24] Oh, those caches [22:51:30] Bleh, I can add to scap clean [22:51:32] bd808: mind copy pasting those tasks to https://phabricator.wikimedia.org/T158358 ? :] [22:51:34] * RainbowSprinkles makes note [22:51:54] https://gerrit.wikimedia.org/r/#/c/336730/ https://gerrit.wikimedia.org/r/#/c/336901/ [22:52:01] ^ reviews welcome kthnxbai [22:52:10] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034831 (10Dzahn) also see: T130317, T133913, T119747 [22:52:20] T119747 is outdated [22:52:21] :] [22:52:26] deleteMediaWiki was dumb so I killed it [22:52:28] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034871 (10bd808) [22:52:34] T119747 should be about scap clean now [22:52:39] heh [22:52:40] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:52:47] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034872 (10Dzahn) I ran "apt-get clean" on tin which freed another 2G or so [22:52:52] But yes, see ./scap/plugins/clean.py in mediawiki-config [22:52:54] neat [22:52:54] hrm, scap clean for 1.29.0-wmf.5 had some...errors [22:52:55] If you want to contribute [22:52:58] Or review those patches [22:53:09] thcipriani: Yes, I know [22:53:10] so I am not touching anything since it is close to midnight here [22:53:11] See final comment on https://gerrit.wikimedia.org/r/#/c/336730/ [22:53:12] I dont want to explode something [22:53:18] Err, wait, that hasn't landed [22:53:20] What errors? [22:53:23] *angry face* [22:54:03] perm errors for masters for deleting l10n dirs [22:54:14] then some other random ones [22:54:17] * thcipriani makes a paste [22:54:29] Ah yes [22:54:30] Ok [22:54:32] Known [22:54:43] (I hate that permission discrepancy) [22:54:54] Same problem as on that comment I made in the gerrit change [22:55:05] deploy masters /srv/mediawiki-staging/ need 2 passes [22:55:10] One for l10n, one for everything else [22:55:25] (again, screw that discrepancy) [22:55:52] I like the idea of removing l10nupdate [22:55:56] +10000 [22:55:58] write an rfc [22:56:00] PROBLEM - puppet last run on mc1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:56:00] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034874 (10Dzahn) [22:56:03] now that we deploy once per week, it is probably less useful than it used to be [22:56:06] and probably [22:56:11] Indeed [22:56:14] Plus it's always broken [22:56:16] There's race conditions [22:56:25] did you delete stuff already? [22:56:26] Easy to overwrite to prior non-auto msgs [22:56:27] we could get the l10n bot to refresh translation once per week instead of on a daily basis accross a thousand of repos [22:56:28] 21G free [22:56:33] https://phabricator.wikimedia.org/P4942 [22:56:33] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034831 (10Reedy) I killed the 1.28 l10nupdate cache folders, and the 1.29 ones < .10 [22:56:38] hashar: It refreshes anyway with train [22:56:41] ah [22:56:45] Then subsequent scaps overwrite the messages [22:56:45] also a bunch of stuff in the .git directory [22:56:49] until l10nupdate comes along [22:56:53] And reverts [22:56:57] yeah, once a week is pointless with the train [22:56:58] Again and again and again they fight [22:57:04] So awesome. [22:57:04] Manual scaps v l10nupdate [22:57:16] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034878 (10Dzahn) Makes me think how mira is doing. [22:57:20] RainbowSprinkles: That's gotta be the best reason to just disable it [22:58:08] !log maxsem@tin Finished scap: Update messages for https://gerrit.wikimedia.org/r/#/c/338013/ (duration: 24m 29s) [22:58:09] Reedy: I've been meaning to write an RfC on it but haven't had the spare time [22:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:00] PROBLEM - puppet last run on francium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:02:58] !log maxsem@tin Started scap: Another time, just ot make sure some files synched cuz lat time there were some mid-air collisions [23:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:10] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [23:03:10] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active [23:04:40] RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.56 port 9042 [23:04:41] RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2017-09-12 15:36:12 +0000 (expires in 207 days) [23:05:10] RECOVERY - Check systemd state on restbase2010 is OK: OK - running: The system is fully operational [23:05:20] RECOVERY - cassandra-a service on restbase2010 is OK: OK - cassandra-a is active [23:06:10] RECOVERY - cassandra-a CQL 10.192.16.186:9042 on restbase2010 is OK: TCP OK - 0.036 second response time on 10.192.16.186 port 9042 [23:06:10] RECOVERY - cassandra-a SSL 10.192.16.186:7001 on restbase2010 is OK: SSL OK - Certificate restbase2010-a valid until 2017-11-17 00:54:24 +0000 (expires in 273 days) [23:08:36] (03PS6) 10Hashar: jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 [23:08:57] (03CR) 10jerkins-bot: [V: 04-1] jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 (owner: 10Hashar) [23:09:14] (03CR) 10Hashar: "I passed http_port to the 'jenkins' class to have the daemon listen on that port. But I forgot to pass http_port for the Apache proxy par" [puppet] - 10https://gerrit.wikimedia.org/r/337388 (owner: 10Hashar) [23:09:36] (03PS5) 10Hashar: jenkins: support umask via service default [puppet] - 10https://gerrit.wikimedia.org/r/337377 [23:14:32] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/5498/ pass :]" [puppet] - 10https://gerrit.wikimedia.org/r/337377 (owner: 10Hashar) [23:15:27] (03PS1) 10ArielGlenn: write results from getlastpageid and getlastrevid to stdout, not stderr [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/338280 [23:15:29] (03PS1) 10ArielGlenn: update .gitignore with the binaries for the new utilities [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/338281 [23:15:31] (03PS1) 10ArielGlenn: script to check whether page range of bz2 checkpoint file is correct [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/338282 [23:15:38] yeah, I was hoarding, sorry [23:17:18] (03CR) 10Dzahn: [C: 032] adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn) [23:17:26] (03PS6) 10Dzahn: adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) [23:18:43] !log maxsem@tin Finished scap: Another time, just ot make sure some files synched cuz lat time there were some mid-air collisions (duration: 15m 44s) [23:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:10] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:20:40] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [23:21:50] PROBLEM - puppet last run on db1076 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:21:58] hashar: alright, we can do the umask change if you are still here to confirm it :) [23:22:08] yeah [23:22:15] im getting a ton of errors on the labs phabricator instance [23:22:18] with stuff like [23:22:19] Feb 16 23:16:46 phabricator systemd[1]: [/lib/systemd/system/keyholder-proxy.service:12] Unknown lvalue 'ExecPre' in section 'Service' [23:22:24] (03PS6) 10Dzahn: jenkins: support umask via service default [puppet] - 10https://gerrit.wikimedia.org/r/337377 (owner: 10Hashar) [23:22:44] Feb 16 23:18:28 phabricator nslcd[2298]: [adea3d] ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Can't contact LDAP server: Connection timed out [23:22:45] and ^^ [23:22:46] mutante ^^ [23:22:47] paladox: ? _after_ all the testing you just did? [23:22:57] oh, LDAP server [23:22:58] not realted to any of the stuff we did [23:22:58] mutante: though I would rather not restart JEnkins now to double check :] [23:24:00] RECOVERY - puppet last run on mc1030 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [23:24:04] hashar: better another time? [23:24:12] I mean [23:24:15] the change can land [23:24:33] just have to verify that sourcing /etc/default/jenkins has UMASK properly set [23:24:48] then the init.d will catch it [23:24:51] and pass --umask [23:24:58] can verify tomorrow [23:25:00] paladox: is this on all instances or just one? [23:25:11] mutante only seen it on one [23:25:15] * paladox checks the others [23:25:32] hashar: ok, i can land it, but i won't be here tomorrow [23:26:04] so .. maybe better to do that together then? [23:26:10] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:26:29] we can land the change, and I will restart jenkins tomorrow to triple confirm [23:26:33] but I am not worrying :] [23:26:35] ok [23:26:37] but I am not worried [23:26:44] i doint see it on the other instances [23:26:44] (03CR) 10Dzahn: [C: 032] jenkins: support umask via service default [puppet] - 10https://gerrit.wikimedia.org/r/337377 (owner: 10Hashar) [23:26:51] paladox: firewalling? [23:26:59] paladox: any changes? [23:26:59] I am afraid of having to fix up jenkins at 1am :] [23:27:17] Maybe, only change i did was change vcs ip to floating ip. [23:27:20] for ssh. [23:27:33] mutante: thanks for all the patches review this week :] [23:27:43] paladox: probably related to getting the new IP and firewalling [23:27:55] paladox: try to connect to it manually with telnet or nc [23:28:02] ok [23:28:11] hashar: yw :) [23:28:17] oh [23:28:24] telnet ldap://ldap-labs.eqiad.wikimedia.org 389 [23:28:24] I can actually test on cont2001 :) [23:28:43] yes to both :) [23:28:53] telnet ldap-labs.eqiad.wikimedia.org 389 [23:28:53] Trying 208.80.154.79... [23:28:53] telnet: Unable to connect to remote host: No route to host [23:28:56] paladox: well, without the protocol [23:29:00] RECOVERY - puppet last run on francium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [23:29:24] -# UMASK=027 [23:29:24] +UMASK=0002 [23:29:31] --umask=0002 [23:29:34] on contint2001 :) [23:29:44] applied on 1001 [23:30:16] so should be fine [23:30:23] I will restart Jenkins on contint1001 tomorrow [23:30:56] and the logrotate for /var/log/jenkins/access.log works! [23:32:08] (03CR) 10JGirault: gerrit: Make blue buttons look like OOUI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298) (owner: 10Ladsgroup) [23:32:21] hashar: :) ok, nice [23:32:30] enjoy your week-end! :] [23:32:34] I am heading to bed [23:32:39] you too, it will be long over here [23:32:49] "president's day" .. great timing for that, heh [23:33:06] good night hashar, bye [23:34:09] :] [23:41:57] (03PS1) 10RobH: update nithum's ssh pub key [puppet] - 10https://gerrit.wikimedia.org/r/338291 [23:43:00] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:43:16] Amir1: due to other work, your html fixes will go out tomorrow evening rather than today [23:43:29] I just wound up my regular work for the night (almost 2 am) [23:43:35] sorry for the delay [23:45:36] (03CR) 10RobH: [C: 032] update nithum's ssh pub key [puppet] - 10https://gerrit.wikimedia.org/r/338291 (owner: 10RobH) [23:48:11] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [23:49:35] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3035128 (10RobH) The new key is now live, it can take up to 30 minutes for all affected hosts to call in for the change. [23:50:50] RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [23:54:10] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures