[00:07:53] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:10:01] (03CR) 10BBlack: [C: 032 V: 032] Update multi-cert patches to Apr 27 version from nginx-devel list [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/230472 (owner: 10BBlack) [00:10:10] (03CR) 10BBlack: [C: 032 V: 032] Add multicert changes to the new stream modules [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/230473 (owner: 10BBlack) [00:10:20] (03CR) 10BBlack: [C: 032 V: 032] Release 1.9.3-1+wmf2 (newer multicert patches) [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/230469 (owner: 10BBlack) [00:11:44] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 497 bytes in 0.004 second response time [00:23:13] PROBLEM - puppet last run on cp3009 is CRITICAL puppet fail [00:31:28] catchpoint alerts but things seem ok [00:31:41] bblack: are you deploying the nginx versions now? [00:34:20] all seems ok now [00:34:22] * YuviPanda goes away [00:50:55] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:55:39] umm, phab down again? [00:55:42] hey, phab is down [00:55:44] Failed to create a temporary directory: the disk is full. [00:56:03] yesterday apergos deleted some log files [00:57:38] yup, same here [00:58:04] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 1764 bytes in 0.880 second response time [00:58:10] there we go [00:58:18] we apparently are faster than icinga, even on a Sunday ;) [00:59:19] twentyafterfour: around, per chance? need to delete some more files on the phab host [01:00:02] is it running on a thumb drive? [01:00:20] beagel board [01:10:42] (03PS1) 10Tim Landscheidt: Tools: Puppetize gridengine global configuration [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) [01:10:55] I have a spare 40GB SATA drive from my PS3 I can donate to our phabricator instance, just tell me where to drive [01:11:40] :/ [01:12:22] Microsoft once had a fun issue with the tracking system... bug number exceeded 64535... [01:14:37] 64535? [01:15:30] pro-tip: if you were in the middle of submitting an epic bug report, open Firebug, Ctrl-R to repeat the form submission, then you can copy the POST Parameters from Net > POST /maniphest/task/create > Post tab context menu. [01:16:06] hah [01:16:34] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [01:19:21] yurik, do you mean 65535? [01:19:55] Krenair, yep, typo ;) [01:21:39] (03CR) 10Tim Landscheidt: [C: 04-1] ""EDITOR=tee qconf -mconf" – not pretty, but works :-). Tested on Toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) (owner: 10Tim Landscheidt) [01:23:53] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 21162 bytes in 0.170 second response time [01:24:25] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:24:25] thanks someone [01:25:00] Might have just been logrotate kicking in [01:25:03] Or someone [01:25:11] no !log [01:29:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 312986 MB (10% inode=99%) [01:34:44] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 303912 MB (10% inode=99%) [01:35:33] 6operations, 3Discovery-Maps-Sprint: Postgres replication is not working - https://phabricator.wikimedia.org/T108545#1523135 (10MaxSem) 3NEW [01:39:37] 6operations, 7Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#1523142 (10Tgr) 3NEW [01:39:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301452 MB (10% inode=99%) [01:44:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [01:49:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [01:51:16] matt_flaschen: Your convertLqtOnLocalWiki script on terbium is using ~15GB memory by now, you might want to look after it [01:54:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [01:56:55] that would explain why it keeps getting slower. [01:57:07] hoo, damn, thanks for letting me know. [01:57:09] 6operations, 3Discovery-Maps-Sprint: Postgres replication is not working - https://phabricator.wikimedia.org/T108545#1523162 (10Yurik) [01:57:30] I'll kill it now (it is resumable). [01:58:12] Great :) [01:58:26] https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&h=terbium.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=mem_report&c=Miscellaneous+eqiad [01:59:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:00:53] 6operations, 3Discovery-Maps-Sprint: Postgres replication is not working - https://phabricator.wikimedia.org/T108545#1523172 (10Yurik) [02:02:16] matt_flaschen: Shall I quickly kill it for you? [02:04:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:07:36] phabricator seems to be broken: Failed to create a temporary directory: the disk is full. [02:08:59] hoo, no, I've got it, sorry for the delay. [02:09:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:10:44] Ctrl-Ced. [02:11:25] Phabricator is down? [02:11:36] Unhandled Exception ("FilesystemException"): Failed to create a temporary directory: the disk is full. [02:11:48] yeah [02:12:00] that's known, but apparently no one with shell on that machine around [02:12:49] I would file a Phabricator bug requesting icinga monitoring of disk space, but... [02:14:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:16:05] matt_flaschen: Just create a patch... but I doubt it's worth it [02:16:41] most likely it's just a volume sitting on a way bigger (s|h)dd waiting to be resized [02:18:55] I would create a patch if I knew how (or someone pointed me in the right direction). [02:19:11] Definitely worth it in my opinion. Very disruptive but easily preventable things like this are a perfect use case for monitoring. [02:19:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:20:25] !log l10nupdate Synchronized php-1.26wmf17/cache/l10n: l10nupdate for 1.26wmf17 (duration: 06m 43s) [02:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:47] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf17) at 2015-08-10 02:23:47+00:00 [02:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:24:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:29:54] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:34:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:39:44] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:40:23] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:42:00] seems phabricator is up again [02:42:13] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 497 bytes in 0.009 second response time [02:44:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:49:54] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:51:22] some functions are still broken [02:51:26] e.g. https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-a6x3rjqwi76d57n/ [02:54:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [02:59:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:01:52] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1523217 (10Tgr) If you followed the instructions in https://www.howtoforge.com/how-to-install-mediawiki-on-ubuntu-14.04 exact... [03:04:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:09:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:13:19] !log rm /var/log/apache2/phabricator_access.log.1 on iridium (disk full, fixed for now) [03:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:13:47] !log restarted apache2 on iridium JIC [03:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:13:54] RECOVERY - Disk space on iridium is OK: DISK OK [03:14:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:19:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:24:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:25:28] (03PS1) 10Smalyshev: T107819: allowd wdqs admins to sudo into blazegraph user [puppet] - 10https://gerrit.wikimedia.org/r/230482 [03:26:12] (03CR) 10jenkins-bot: [V: 04-1] T107819: allowd wdqs admins to sudo into blazegraph user [puppet] - 10https://gerrit.wikimedia.org/r/230482 (owner: 10Smalyshev) [03:29:02] (03PS2) 10Smalyshev: T107819: allowd wdqs admins to sudo into blazegraph user [puppet] - 10https://gerrit.wikimedia.org/r/230482 [03:29:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:34:44] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:39:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:44:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:49:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:54:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [03:59:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:02:35] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [04:04:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:09:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:10:16] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [04:14:42] (03PS1) 10Smalyshev: Create real URIs for wikidata RDF URIs [puppet] - 10https://gerrit.wikimedia.org/r/230483 [04:14:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:19:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873902 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:24:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:28:34] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (7105 100000s) [04:29:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:34:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:39:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:44:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:49:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:54:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [04:59:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:04:44] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:09:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:14:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:19:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:24:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:29:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:31:29] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Aug 10 05:31:29 UTC 2015 (duration 31m 28s) [05:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:34:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:39:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:42:23] (03CR) 10Edenhill: [C: 031] [WIP] Add format.topic configuration (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/230173 (https://phabricator.wikimedia.org/T108379) (owner: 10Ottomata) [05:44:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:49:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:54:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [05:59:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:04:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:09:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:14:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:19:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:21:02] apergos: ^ ? [06:23:14] <_joe_> matanya: it's a FR host, I have no access [06:23:33] <_joe_> (btw now you know why I say we should avoid icinga alerts with variable messages) [06:24:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:26:04] heh, thanks _joe_ [06:29:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873906 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:31:13] PROBLEM - puppet last run on mw2069 is CRITICAL Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on db1046 is CRITICAL Puppet has 1 failures [06:34:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873906 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:39:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873906 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:44:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873906 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:48:30] (03PS1) 10Jcrespo: Increase db1035 weight after repool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230492 [06:49:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873906 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:50:09] (03CR) 10Jcrespo: [C: 032] Increase db1035 weight after repool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230492 (owner: 10Jcrespo) [06:52:02] morning [06:52:06] gee lots happening already [06:52:33] <_joe_> apergos: not really :) [06:52:40] <_joe_> a lot of icinga spam [06:52:44] !log jynus Synchronized wmf-config/db-eqiad.php: Increase db1035 weight (duration: 00m 13s) [06:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:53:14] ah. "lots spamming already" then :-D [06:54:24] PROBLEM - puppet last run on mw2107 is CRITICAL Puppet has 1 failures [06:54:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873906 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [06:55:34] RECOVERY - puppet last run on db1046 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:56:43] RECOVERY - puppet last run on mw2069 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [07:01:10] (03CR) 10Giuseppe Lavagetto: [C: 032] T107819: allowd wdqs admins to sudo into blazegraph user [puppet] - 10https://gerrit.wikimedia.org/r/230482 (owner: 10Smalyshev) [07:04:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [07:08:21] 6operations, 10vm-requests, 5Patch-For-Review, 7Pybal: codfw: 3 VM request for PyBal - https://phabricator.wikimedia.org/T107901#1523413 (10akosiaris) 5Open>3Resolved I 've created the VMs on Friday but never did the first puppet run on them, aiming to do it today. Seems like @Dzahn did it though on Fr... [07:09:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [07:14:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [07:19:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I am thinking we should use base::service_unit instead to obtain forwards compatibility with Debian Jessie as well and maintain compatibil" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/230108 (owner: 10KartikMistry) [07:19:53] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [07:20:22] !log schema change on testwikidatawiki [07:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:22:13] RECOVERY - puppet last run on mw2107 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:22:44] PROBLEM - puppet last run on mw1047 is CRITICAL Puppet has 1 failures [07:26:40] (03CR) 10Merlijn van Deen: [C: 031] "lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) (owner: 10Tim Landscheidt) [07:27:21] !log rebooting backup4001 [07:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:34:06] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=1 dev=sdb failed - https://phabricator.wikimedia.org/T108561#1523450 (10fgiunchedi) 3NEW [07:36:10] (03PS1) 10Amire80: Add https://blogdukiwi.wordpress.com/ to the French Planet [puppet] - 10https://gerrit.wikimedia.org/r/230497 [07:39:03] 6operations, 10ops-codfw: ms-be2006.codfw.wmnet: slot=11 dev=sdl failed - https://phabricator.wikimedia.org/T108562#1523455 (10fgiunchedi) 3NEW [07:42:57] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 5Patch-For-Review: Need sudo to blazegraph on wdqs1001/1002 - https://phabricator.wikimedia.org/T107819#1523466 (10Smalyshev) 5Open>3Resolved [07:44:35] !log reboot ms-be2006, xfs hosed [07:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:47:03] RECOVERY - puppet last run on mw1047 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:50:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [07:54:04] (03PS3) 10Giuseppe Lavagetto: apache: allow tuning of logrotate parameters [puppet] - 10https://gerrit.wikimedia.org/r/230384 [07:55:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:00:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:02:25] (03PS1) 10Filippo Giunchedi: rsyslog: use logrotate delaycompress and reload rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/230500 (https://phabricator.wikimedia.org/T107611) [08:05:14] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:06:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] rsyslog: use logrotate delaycompress and reload rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/230500 (https://phabricator.wikimedia.org/T107611) (owner: 10Filippo Giunchedi) [08:10:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:10:54] RECOVERY - RAID on ms-be2006 is OK optimal, 13 logical, 13 physical [08:11:03] RECOVERY - very high load average likely xfs on ms-be2006 is OK - load average: 20.68, 7.79, 2.84 [08:12:22] 6operations, 10ops-codfw: ms-be2006.codfw.wmnet: slot=11 dev=sdl failed - https://phabricator.wikimedia.org/T108562#1523503 (10fgiunchedi) [08:12:23] 6operations, 10ops-codfw: ms-be2006 failed disk - https://phabricator.wikimedia.org/T108340#1523506 (10fgiunchedi) [08:15:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:18:57] (03CR) 10Alexandros Kosiaris: [C: 031] "I was noting on IRC that this ensures absent logrotate-passenger and apache2 is already ensured absent effectively meaning no rotation but" [puppet] - 10https://gerrit.wikimedia.org/r/230384 (owner: 10Giuseppe Lavagetto) [08:20:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:21:15] (03Abandoned) 10KartikMistry: Limit number of APY instances to 8 [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/229359 (owner: 10KartikMistry) [08:25:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:30:14] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:35:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:37:47] (03PS4) 10Giuseppe Lavagetto: apache: allow tuning of logrotate parameters [puppet] - 10https://gerrit.wikimedia.org/r/230384 [08:39:11] (03CR) 10Giuseppe Lavagetto: [C: 032] apache: allow tuning of logrotate parameters [puppet] - 10https://gerrit.wikimedia.org/r/230384 (owner: 10Giuseppe Lavagetto) [08:39:17] 7Puppet, 6operations: merge swift_new and swift puppet modules/classes - https://phabricator.wikimedia.org/T107416#1523528 (10fgiunchedi) also we should be adding `nobootwait` to swift disks to avoid getting stuck on missing/failed drives [08:40:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:43:00] 6operations, 5Patch-For-Review: rotate phab access logs more often on iridium - https://phabricator.wikimedia.org/T108503#1523531 (10Joe) a:3Joe [08:43:16] <_joe_> !log manually running logrotate on iridium [08:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:45:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:46:12] (03PS1) 10Jcrespo: Depool db1042 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230502 [08:50:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:50:44] PROBLEM - puppet last run on labcontrol1001 is CRITICAL Puppet has 1 failures [08:53:10] 6operations, 10ops-codfw: ms-be2006 failed disk - https://phabricator.wikimedia.org/T108340#1523540 (10fgiunchedi) indeed, I've wiped the fs and recreated on `sdg`. also put back `sdl` in service as it was in unconfigured good, both rebuilding now ``` /dev/sdl1 1.9T 12G 1.9T 1% /srv/swift-storage/s... [08:55:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [08:56:10] (03PS1) 10Giuseppe Lavagetto: dsh: re-add mw1061 [puppet] - 10https://gerrit.wikimedia.org/r/230504 (https://phabricator.wikimedia.org/T107849) [08:56:34] (03CR) 10Giuseppe Lavagetto: [C: 032] dsh: re-add mw1061 [puppet] - 10https://gerrit.wikimedia.org/r/230504 (https://phabricator.wikimedia.org/T107849) (owner: 10Giuseppe Lavagetto) [08:56:50] (03CR) 10Giuseppe Lavagetto: [V: 032] dsh: re-add mw1061 [puppet] - 10https://gerrit.wikimedia.org/r/230504 (https://phabricator.wikimedia.org/T107849) (owner: 10Giuseppe Lavagetto) [08:57:08] PROBLEM - puppet last run on labcontrol1002 is CRITICAL Puppet has 1 failures [08:58:37] <_joe_> arg, that ^^ is me, fixing [09:01:54] PROBLEM - Host backup4001 is DOWN: PING CRITICAL - Packet loss = 100% [09:02:34] oh for crying out loud [09:02:48] that was me trying to play it smart and stop those alerts [09:02:59] didn't know that would page... [09:03:37] hehe [09:03:43] RECOVERY - puppet last run on labcontrol1002 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:03:50] schedule downtime ? [09:04:39] for what [09:05:09] backup4001 is fundraising, this is an nsca check [09:05:24] RECOVERY - Host backup4001 is UPING OK - Packet loss = 0%, RTA = 74.34 ms [09:06:12] avoiding the pages [09:06:29] I 've rebooted the box once already btw [09:06:32] why the hell are we paging for a backup host in the first place... [09:07:01] because we are paging for all fundraising hosts ? [09:08:07] (03PS1) 10Faidon Liambotis: torrus: remove support for bits.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/230505 [09:08:21] akosiaris: ^ ? :) [09:09:21] (03CR) 10Jcrespo: [C: 032] Depool db1042 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230502 (owner: 10Jcrespo) [09:09:44] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:12:42] !log jynus Synchronized wmf-config/db-eqiad.php: depool db1042 for maintenance (duration: 00m 12s) [09:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:13] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 873905 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300819 MB (10% inode=99%) [09:35:21] !log manually firewalled backup4001 TCP on neon to temporarily stop the nsca alert storm [09:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:42:14] PROBLEM - puppet last run on ms-be2006 is CRITICAL Puppet has 1 failures [09:46:14] RECOVERY - mediawiki-installation DSH group on mw1061 is OK [09:46:14] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 20 minutes ago with 0 failures [09:50:25] RECOVERY - puppet last run on ms-be2006 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:53:04] PROBLEM - swift-object-server on ms-be2006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:54:52] that was me ^ should be recovering [09:55:05] RECOVERY - swift-object-server on ms-be2006 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:56:25] (03PS1) 10ArielGlenn: dumps: fix link cleanup for stubs/content parallel runs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/230507 [09:59:28] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1061 has a faulty disk, filesystem is read-only - https://phabricator.wikimedia.org/T107849#1523612 (10Joe) 5Open>3Resolved [10:01:57] 6operations, 10ops-ulsfo: RIPE Atlas Anchor @ ulsfo is down - https://phabricator.wikimedia.org/T107691#1523614 (10faidon) a:3RobH [10:01:57] (03CR) 10ArielGlenn: [C: 032] dumps: fix link cleanup for stubs/content parallel runs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/230507 (owner: 10ArielGlenn) [10:03:23] 6operations, 7Graphite: scale statsd reporting/aggregation (plan) - https://phabricator.wikimedia.org/T89857#1523618 (10fgiunchedi) >>! In T89857#1519939, @Tgr wrote: > https://gerrit.wikimedia.org/r/#/c/226639/6 adds a (somewhat awkward) workaround via StatsD request sampling (in MediaWiki). Per IRC discussio... [10:07:30] (03CR) 10Alexandros Kosiaris: [C: 032] torrus: remove support for bits.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/230505 (owner: 10Faidon Liambotis) [10:10:16] (03CR) 10Filippo Giunchedi: logstash: Count MediaWiki log events with statsd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/230233 (https://phabricator.wikimedia.org/T100735) (owner: 10BryanDavis) [10:11:51] akosiaris: is puppet disabled @ netmon1001 you? [10:17:09] paravoid: yes [10:17:14] ok :) [10:17:18] you will know why in about 10 secs [10:17:27] (03PS1) 10Alexandros Kosiaris: Fix for bug introduced in 811479c [puppet] - 10https://gerrit.wikimedia.org/r/230509 [10:17:31] ^ [10:17:38] should have caught that [10:18:02] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix for bug introduced in 811479c [puppet] - 10https://gerrit.wikimedia.org/r/230509 (owner: 10Alexandros Kosiaris) [10:18:17] oops [10:18:21] I hate torrus btw [10:18:29] have I mentioned that ? :P [10:18:44] so who's using it nowadays? [10:18:49] I'm not, not sure if brandon is [10:18:53] mark :P [10:19:04] I know I am not [10:19:14] i'm not [10:19:32] great! can I kill it yesterday then ? [10:19:39] other than for PDU power graphs perhaps [10:19:40] a bit for that [10:19:46] aaah see ? [10:19:49] I can't :-( [10:20:08] although that's in librenms as well these days, isn't it ? [10:20:13] RECOVERY - puppet last run on netmon1001 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [10:20:20] not aggregated [10:20:28] which is really the main point missing and what torrus is good for [10:21:23] ok, so how about we at least kill anything but the power strips from torrus [10:21:25] at least for now [10:21:54] and actually update the powerstrips cause all it has is eqiad [10:25:25] (03PS1) 10Matanya: access: stat1002 access for tgr [puppet] - 10https://gerrit.wikimedia.org/r/230510 [10:26:04] (03PS2) 10Giuseppe Lavagetto: ganglia-monitor-aggregator: fix upstart script [puppet] - 10https://gerrit.wikimedia.org/r/228805 [10:26:12] (03CR) 10jenkins-bot: [V: 04-1] access: stat1002 access for tgr [puppet] - 10https://gerrit.wikimedia.org/r/230510 (owner: 10Matanya) [10:29:13] (03PS2) 10Matanya: access: stat1002 access for tgr [puppet] - 10https://gerrit.wikimedia.org/r/230510 [10:29:22] * matanya shakes fist at vi [11:02:46] Dear opsen, pywikipedia.org was a cname to wikimedia-lb.wikimedia.org., which is no more. Should we just replace it by a cname to wikimedia.org? It's handled on the WMF side by ops/apache-config/redirects.* [11:03:23] <_joe_> valhallasw`cloud: what does it point to? [11:03:33] <_joe_> via redirects, I mean [11:03:33] it's a redirect to tools.wmflabs.org/pywikibot [11:03:52] pywikipedia.org is owned by WMNL [11:04:03] <_joe_> ok so... uhm, yes CNAME to wikimedia.org is probably a good bet :) [11:06:14] akosiaris, hi, we have a problem :( pgsql replication no longer works, can't update functions, etc - https://phabricator.wikimedia.org/T108545 [11:07:43] yurik: yes I am aware [11:07:59] akosiaris, on the other side, https://maps.wikimedia.org/static/#7/19.435/-99.146 ;) [11:08:10] its alive!!! :) [11:08:31] cool [11:08:31] (updating tiles at the moment, a bit broken in places) [11:13:54] (03PS1) 10Hoo man: Revert "Set dispatchBatchChunkFactor to 10 for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230515 [11:14:42] (03CR) 10Hoo man: [C: 032] Revert "Set dispatchBatchChunkFactor to 10 for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230515 (owner: 10Hoo man) [11:15:06] (03Merged) 10jenkins-bot: Revert "Set dispatchBatchChunkFactor to 10 for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230515 (owner: 10Hoo man) [11:16:04] !log hoo Synchronized php-1.26wmf17/extensions/Wikidata/: Revert "Set dispatchBatchChunkFactor to 10 for now" (duration: 00m 20s) [11:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:16:12] doh, wrong path [11:16:46] !log hoo Synchronized wmf-config/: Revert "Set dispatchBatchChunkFactor to 10 for now" (duration: 00m 12s) [11:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:21:54] PROBLEM - Check size of conntrack table on xenon is CRITICAL nf_conntrack is 99 % full [11:25:54] RECOVERY - Check size of conntrack table on xenon is OK nf_conntrack is 35 % full [11:29:04] PROBLEM - salt-minion processes on db1010 is CRITICAL: Connection refused by host [11:29:04] PROBLEM - RAID on db1010 is CRITICAL: Connection refused by host [11:31:13] RECOVERY - salt-minion processes on db1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:31:14] RECOVERY - RAID on db1010 is OK optimal, 1 logical, 2 physical [11:31:31] ^I do not see a reason for this [11:35:55] there is however, something strange with its mysql, will investigate later [11:50:35] (03PS1) 10Alexandros Kosiaris: swap scb100{1,2} MAC addresses in dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/230517 [11:53:03] PROBLEM - puppet last run on mw2048 is CRITICAL puppet fail [12:00:54] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [12:00:57] (03CR) 10Alexandros Kosiaris: [C: 032] swap scb100{1,2} MAC addresses in dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/230517 (owner: 10Alexandros Kosiaris) [12:04:04] (03PS1) 10Alexandros Kosiaris: Fix scb1002's IP address [dns] - 10https://gerrit.wikimedia.org/r/230519 [12:05:09] (03CR) 10Alexandros Kosiaris: [C: 032] Fix scb1002's IP address [dns] - 10https://gerrit.wikimedia.org/r/230519 (owner: 10Alexandros Kosiaris) [12:14:45] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1523717 (10ArielGlenn) I'll be merging this tomorrow (end of 3 day wait) unless someone shoots it before then [12:19:33] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant log file access to Yurik & Maxsem on maps-test200{1-4} - https://phabricator.wikimedia.org/T106629#1523736 (10ArielGlenn) those files are 640 root (owner) adm (group). so we're really talking about root access here. we... [12:21:43] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant log file access to Yurik & Maxsem on maps-test200{1-4} - https://phabricator.wikimedia.org/T106629#1523742 (10ArielGlenn) noting here the related ticket: https://phabricator.wikimedia.org/T106637 [12:21:43] RECOVERY - puppet last run on mw2048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:26:33] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1523770 (10Yurik) @akosiaris, hi, thx for detailed msg. A few updates: * /var/log/postgres/* -- i think it would help with T108545 (Postgr... [12:29:55] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant log file access to Yurik & Maxsem on maps-test200{1-4} - https://phabricator.wikimedia.org/T106629#1523786 (10mark) a:3akosiaris Alex is assigned to work on this project, should look at this. [12:35:11] (03PS1) 10ArielGlenn: remove now obselete snapshot hosts sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/230524 [12:36:08] 7Puppet, 6operations: Clean up files/snapshot/sudoers.snapshot - https://phabricator.wikimedia.org/T107479#1523805 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/230524/ this file can be tossed. [12:36:20] 7Puppet, 6operations: Clean up files/snapshot/sudoers.snapshot - https://phabricator.wikimedia.org/T107479#1523806 (10ArielGlenn) a:3ArielGlenn [12:38:19] !log deployed nginx-1.9.3-1+wmf2 to cp1065, cp1070, cp1071 (1x each text, upload, misc) for validation [12:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:00:44] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1523852 (10Ottomata) If @csteipp wants access to webrequest logs in Hive, he will need to be in the analytics-privatedata-users group. I just updated [[ https://wikitech.wikimedi... [13:07:19] bblack, hi, around? everything good so far? [13:10:21] everything in the world? [13:11:32] bblack, the world is a mess, but WP & WMF are the islands of peace, tranquility, and perfect order... maybe except for the maps servers ) [13:13:04] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1523889 (10fgiunchedi) note that libvpx 1.4 changed SONAME from 1 to 2 though it looks like an ABI... [13:27:35] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 6935.76635782 [13:28:30] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1523926 (10Ottomata) Yup. I just created the webrequest_maps kafka topic: ``` kafka topic --create --topic webrequest_maps --partition... [13:36:03] PROBLEM - RAID on es2007 is CRITICAL 1 failed LD(s) (Degraded) [13:38:09] (03PS1) 10Alexandros Kosiaris: swap scb100{1,2}.eqiad.wmnet IP addresses and mgmts [dns] - 10https://gerrit.wikimedia.org/r/230536 [13:40:13] 6operations, 10Traffic, 7HTTPS: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580#1523942 (10BBlack) 3NEW [13:41:14] (03CR) 10Alexandros Kosiaris: [C: 032] swap scb100{1,2}.eqiad.wmnet IP addresses and mgmts [dns] - 10https://gerrit.wikimedia.org/r/230536 (owner: 10Alexandros Kosiaris) [13:45:48] 6operations, 7HTTPS: Getting ssl_error_inappropriate_fallback_alert very rarely - https://phabricator.wikimedia.org/T108579#1523961 (10Krenair) [13:46:00] 6operations, 10Traffic, 7HTTPS: Getting ssl_error_inappropriate_fallback_alert very rarely - https://phabricator.wikimedia.org/T108579#1523931 (10Krenair) [13:46:50] (03PS1) 10BBlack: Add webrequest_maps kafka topic output for cache_maps [puppet] - 10https://gerrit.wikimedia.org/r/230539 (https://phabricator.wikimedia.org/T105076) [13:47:08] (03PS1) 10BBlack: cache::config: remove unused swift backend def [puppet] - 10https://gerrit.wikimedia.org/r/230540 [13:47:10] (03PS1) 10BBlack: cache::config: replace lvs IP refs with service hostnames [puppet] - 10https://gerrit.wikimedia.org/r/230541 (https://phabricator.wikimedia.org/T108580) [13:48:02] bblack i think you will also need to make the 'role::cache::kafka::webrequest':' class [13:49:15] that class already exists, we use it for the other clusters too, just different $topic argument? [13:49:56] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1523987 (10VictorGrigas) Yes, I've shot the servers, but now there's a legal issue about the rights that I'm waiting to resolve. Needless to say it's frustrating. I'll let you know as soon as they are up. [13:55:21] Oh, bblack, you are right, sorry. [13:55:22] (03PS1) 10Alexandros Kosiaris: Follow up for c7c6c8b [dns] - 10https://gerrit.wikimedia.org/r/230543 [13:56:25] akosiaris: btw, on those fixups, I usually prefix them with what the error was, helps with git log :) [13:56:57] paravoid: is "I messed up" good enough ? [13:57:07] no, I mean [13:57:20] Fix address for scb1002 (fixup for c7c6c8b) [13:57:22] or something like that [13:58:17] oh ok [13:58:50] (03PS2) 10Alexandros Kosiaris: Fix address for scb1002 (fixup for c7c6c8b) [dns] - 10https://gerrit.wikimedia.org/r/230543 [13:58:52] akosiaris: qq for you regarding package names. since i'm building kafka for multiple distros, is the best thing to do to suffix each build with distro, e.g. ~jessie1, ~trusty1, etc.? [13:59:05] and then reprepro add? [13:59:14] ottomata: yes. reprepro limitation unfortunately [13:59:20] ok cool, thought so [14:02:48] (03CR) 10Alexandros Kosiaris: [C: 032] Fix address for scb1002 (fixup for c7c6c8b) [dns] - 10https://gerrit.wikimedia.org/r/230543 (owner: 10Alexandros Kosiaris) [14:02:55] (03PS3) 10Alexandros Kosiaris: Fix address for scb1002 (fixup for c7c6c8b) [dns] - 10https://gerrit.wikimedia.org/r/230543 [14:02:59] (03CR) 10Alexandros Kosiaris: [V: 032] Fix address for scb1002 (fixup for c7c6c8b) [dns] - 10https://gerrit.wikimedia.org/r/230543 (owner: 10Alexandros Kosiaris) [14:09:40] andrewbogott: morning [14:09:51] hi! [14:10:07] matanya: I feel like you had an access request for the meeting today but I can’t find it. Am I misremembering? [14:10:26] andrewbogott: i am waiting for input from mark [14:10:39] ok [14:10:41] the file upload restricted access thing? [14:10:42] 6operations, 10Traffic, 7HTTPS: Getting ssl_error_inappropriate_fallback_alert very rarely - https://phabricator.wikimedia.org/T108579#1524034 (10BBlack) @dabpunkt can you provide details on the client software (browser version, OS version, etc?) and any local software that might be interfering ("antivirus"... [14:10:50] yes Krenair [14:11:01] andrewbogott: i spoke with one of the people (I think she is a pm) of openstack at redhat on friday [14:11:19] she said they are thinking of obseleting nove-network [14:11:33] yes, they’ve been planning to deprecate nova-network for years [14:11:37] I’ll believe it when I see it :) [14:11:53] it is already deprecated [14:12:01] they are now talking about obseleting [14:12:26] hm, thought it had gone through the full cycle of deprecation back to un-deprecation [14:12:30] But, ok, good to know. [14:12:52] Maybe neutron has been adjusted to finally support our use case… but I doubt it :) [14:13:01] i asked here about it [14:13:14] she said it is planned, but not done yet [14:13:32] * andrewbogott grumbles [14:13:37] it is not in kilo anyway [14:13:39] 'planned, but not done yet’ is the state that project has been in for years [14:13:46] :) [14:14:02] I don’t understand how they can keep pushing for nova-network to be ripped out because a replacement for it is ‘planned’ but no one is working on it [14:14:22] they actully have a group working on it now [14:14:29] very early stages [14:14:30] I’ll just tell our users “It’s ok that your instanaces don’t have network connectivity — don’t worry, a solution is planned!" [14:14:54] that would be great excuse for the next outage! :D [14:15:19] <_joe_> andrewbogott: most people having large openstack installation have rewritten their own network stack, I was told by a friend who manages one [14:15:32] that is right [14:16:02] (03CR) 10BBlack: [C: 04-1] "Should block on deployment of https://gerrit.wikimedia.org/r/#/c/230535/ first" [puppet] - 10https://gerrit.wikimedia.org/r/230539 (https://phabricator.wikimedia.org/T105076) (owner: 10BBlack) [14:16:42] matanya: thanks for passing all this on. I’ll try to overcome my grumpiness and ping some people on the mailing list about possible timelines. [14:16:57] Did she mention any theoretical version target for removing nova-network? [14:17:02] M? N? [14:17:11] N [14:17:33] they would prefer M ofc [14:17:49] ok. N still feels like ‘the long run’ to me :) [14:17:59] Since we’re still running I :/ [14:17:59] a year and a half [14:18:03] yeah [14:19:21] (03PS1) 10BBlack: fix description for maps LVS [puppet] - 10https://gerrit.wikimedia.org/r/230546 (https://phabricator.wikimedia.org/T105076) [14:19:22] (03PS1) 10BBlack: remove icinga monitoring for maps.wm.o for now (not production) [puppet] - 10https://gerrit.wikimedia.org/r/230547 (https://phabricator.wikimedia.org/T105076) [14:19:50] andrewbogott: for ref: https://www.linkedin.com/pub/livnat-peer/8/93/9aa [14:20:05] she is the women I spoke to [14:20:44] (03CR) 10BBlack: [C: 032] fix description for maps LVS [puppet] - 10https://gerrit.wikimedia.org/r/230546 (https://phabricator.wikimedia.org/T105076) (owner: 10BBlack) [14:21:01] (03CR) 10BBlack: [C: 032] remove icinga monitoring for maps.wm.o for now (not production) [puppet] - 10https://gerrit.wikimedia.org/r/230547 (https://phabricator.wikimedia.org/T105076) (owner: 10BBlack) [14:21:33] hm, haven’t met her. I need to go to the conf next time it’s in north america [14:21:46] fast turnover [14:22:25] andrewbogott: i was in the annual linux and open source conf in israel last friday, so I met a lot of interesting people :) [14:22:50] lennard pottering gave the keynote talk [14:25:37] (03PS2) 10Alexandros Kosiaris: Apertium: Add -j -m and parameters [puppet] - 10https://gerrit.wikimedia.org/r/230108 (owner: 10KartikMistry) [14:27:54] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1524063 (10fgiunchedi) I've uploaded a rebuilt ffmpeg `2.7.2-1~wmf2` and was able to transcode the... [14:27:55] (03PS1) 10Yurik: Add account info to kartotherian config erb file [puppet] - 10https://gerrit.wikimedia.org/r/230549 [14:29:43] (03CR) 10Alexandros Kosiaris: [C: 031] "I had wanted to do this back when I refactored the LVS configuration into hiera but followed the path of least resistance back then. This " [puppet] - 10https://gerrit.wikimedia.org/r/230541 (https://phabricator.wikimedia.org/T108580) (owner: 10BBlack) [14:29:51] !log starting upgrade of existing kafka cluster to 0.8.2.1 jessie - https://etherpad.wikimedia.org/p/kafka_0.8.2.1_migration2 [14:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:23] (03PS3) 10Alexandros Kosiaris: Apertium: Add -j -m and parameters [puppet] - 10https://gerrit.wikimedia.org/r/230108 (owner: 10KartikMistry) [14:42:23] (03PS4) 10Alexandros Kosiaris: Apertium: Add -j -m and parameters [puppet] - 10https://gerrit.wikimedia.org/r/230108 (owner: 10KartikMistry) [14:42:24] !log restarted all varnishkafka instances to pick up proper confs (puppet should have done this!) [14:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:39] (03CR) 10Giuseppe Lavagetto: [C: 031] cache::config: replace lvs IP refs with service hostnames [puppet] - 10https://gerrit.wikimedia.org/r/230541 (https://phabricator.wikimedia.org/T108580) (owner: 10BBlack) [14:44:15] RECOVERY - RAID on ms-be2009 is OK optimal, 13 logical, 13 physical [14:44:33] (03PS3) 10coren: nrpe: Merge check_systemd_unit_lastrun into _state [puppet] - 10https://gerrit.wikimedia.org/r/228329 [14:44:57] (03CR) 10coren: nrpe: Merge check_systemd_unit_lastrun into _state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/228329 (owner: 10coren) [14:45:22] (03CR) 10coren: [C: 032] "Carrying Guiseppe's +1 over." [puppet] - 10https://gerrit.wikimedia.org/r/228329 (owner: 10coren) [14:46:36] Coren: outstanding alerts for labstore1002 for 10 days (minus 8 minutes) now [14:46:51] Yeah, I just pushed a patch that should quiet them. [14:47:14] Fault in the test, not the server. [14:47:15] !log running an rsync from nas1001-a to local disks on helium [14:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:49:08] (03CR) 10Guillom: [C: 031] Add https://blogdukiwi.wordpress.com/ to the French Planet [puppet] - 10https://gerrit.wikimedia.org/r/230497 (owner: 10Amire80) [14:49:49] 6operations: Decommission virt1001-1009 - https://phabricator.wikimedia.org/T98376#1524085 (10Andrew) 5Open>3Resolved [14:50:44] PROBLEM - puppet last run on cp4014 is CRITICAL Puppet has 1 failures [14:51:05] PROBLEM - puppet last run on cp2014 is CRITICAL Puppet has 1 failures [14:51:15] PROBLEM - puppet last run on cp3018 is CRITICAL Puppet has 1 failures [14:51:29] ? [14:51:34] PROBLEM - puppet last run on cp2003 is CRITICAL Puppet has 1 failures [14:51:49] systemd thing? [14:52:23] RECOVERY - RAID on ms-be2003 is OK optimal, 13 logical, 13 physical [14:52:26] (03PS1) 10Alexandros Kosiaris: Split scb100{1,2} in the own role [puppet] - 10https://gerrit.wikimedia.org/r/230552 [14:52:39] ESC[1;31mError: /Stage[main]/Nrpe::Systemd_scripts/File[/usr/local/bin/nrpe_check_systemd_unit_lastrun]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/ [14:52:43] nrpe/plugins/check_systemd_unit_lastrunESC[0m [14:53:08] I'm guessing it's just the standard "deploy a ref to a new file created in the same commit" race issue [14:53:30] bblack: re-running puppet by hand will tell :) [14:53:39] yeah trying [14:54:01] yeah it's just that, will self-resolve [14:54:28] I saw something that said puppet4 will be ordered in the manifests so that will be interesting [14:54:51] this is true but this is unrelated to this failure [14:54:54] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [14:55:42] (03PS2) 10Alexandros Kosiaris: Split scb100{1,2} in their own role [puppet] - 10https://gerrit.wikimedia.org/r/230552 [14:55:50] 6operations, 10ops-codfw: ms-be2009 - RAID degraded / failed disk - https://phabricator.wikimedia.org/T107877#1524115 (10Papaul) @ Dzahn please check and see if you stay have the same error. Thanks. [14:56:07] this is a race between the catalog and the fileserver, not a manifest resource ordering issue [14:56:15] `ah [14:56:39] possibly related to the fact that the fileserver can be a different server in our setup [14:56:45] than the one providing the catalog [14:56:47] (palladium vs. strontium) [14:57:48] chasemp: you are referring to https://tickets.puppetlabs.com/browse/PUP-2253 [14:57:53] Coren: not only those 3 alerts aren't fixed but another 3 appeared... [14:58:10] paravoid: any plan soon to migrate to puppet 4 ? [14:58:14] no [14:58:17] paravoid: Hm. The others should go away soon I expect, though I may have to run puppet again. [14:58:21] (03PS1) 10BBlack: tlsproxy: multi_accept off [puppet] - 10https://gerrit.wikimedia.org/r/230553 [14:58:26] no plans and I very much doubt it will be planned soon :) [14:58:27] paravoid: I'm looking at why the others are at issue. [14:59:02] (03CR) 10BBlack: [C: 04-1] "Mostly uploaded this patch so I don't forget about this issue. Needs manual testing first to confirm effects." [puppet] - 10https://gerrit.wikimedia.org/r/230553 (owner: 10BBlack) [14:59:08] sometime in the future works :) [14:59:35] paravoid: Oh right. Missing another commit. Lemme silence tem in the meantime. The other are no longer tests, I expect they will go away next run. I'll force it now. [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150810T1500). [15:00:04] kart_ amire80: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:26] I'm around. Who is SWAT'ng. [15:00:37] kart_: I can SWAT [15:01:09] cool. [15:01:41] (03PS1) 10Ottomata: Disable kafka auto create topics until we are ready for eventlogging on Kafka [puppet] - 10https://gerrit.wikimedia.org/r/230554 [15:01:59] (03CR) 10Ottomata: [C: 032 V: 032] Disable kafka auto create topics until we are ready for eventlogging on Kafka [puppet] - 10https://gerrit.wikimedia.org/r/230554 (owner: 10Ottomata) [15:02:08] lots of nrpe alerts errors for service "service" on cpNNNNs [15:02:13] new ones too [15:02:16] hallo [15:02:25] this is definitely not getting better [15:02:27] hi kart_ [15:02:55] Coren ^^^ bblack [15:03:01] oh Unknowns [15:03:20] Wait, what? [15:03:22] I hate UNKNOWN in icinga. it should just be a CRIT [15:03:24] That shouldn't be me. [15:03:30] Coren: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=8&hoststatustypes=3&serviceprops=2097162&nostatusheader [15:03:43] it's from the systemd check you just committed [15:03:43] hey i just used salt to restart a lot varnishkafka instances, possibly related? or not. [15:03:45] aharoni: oh. I wrote wrong irc nick in Deployment page :) [15:03:46] ok. [15:03:54] thcipriani: should I merge patch? [15:04:05] kart_: just merged, getting all setup here :) [15:04:05] oh you did. Thanks! [15:04:10] bblack: Ah, fuck. Working on it now. [15:04:38] This was reviewed over and over ffs. [15:04:52] * Coren reverts for now. [15:05:17] (03PS1) 10coren: Revert "nrpe: Merge check_systemd_unit_lastrun into _state" [puppet] - 10https://gerrit.wikimedia.org/r/230555 [15:05:24] bblack: ^^ [15:05:38] * Coren beats his head against the wall. [15:06:05] PROBLEM - apertium apy on scb1001 is CRITICAL: Connection refused [15:06:35] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=1970): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:06:45] PROBLEM - apertium apy on scb1002 is CRITICAL: Connection refused [15:06:49] (03CR) 10BBlack: [C: 031] Revert "nrpe: Merge check_systemd_unit_lastrun into _state" [puppet] - 10https://gerrit.wikimedia.org/r/230555 (owner: 10coren) [15:06:55] PROBLEM - cxserver on scb1001 is CRITICAL: Connection refused [15:06:55] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=1970): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:07:15] PROBLEM - cxserver on scb1002 is CRITICAL: Connection refused [15:07:15] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=19000): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:07:25] PROBLEM - mathoid endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=10042): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:07:27] (03PS2) 10coren: Revert "nrpe: Merge check_systemd_unit_lastrun into _state" [puppet] - 10https://gerrit.wikimedia.org/r/230555 [15:07:32] thcipriani: you? [15:07:45] PROBLEM - puppet last run on scb1001 is CRITICAL Puppet has 41 failures [15:07:45] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=19000): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:07:49] eh? [15:07:54] PROBLEM - mathoid endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=10042): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:08:05] PROBLEM - zotero on scb1001 is CRITICAL: Connection refused [15:08:06] matanya: what me? I'm merging some stuff on tin for SWAT [15:08:13] 6operations, 6Discovery, 10MediaWiki-Search, 7Monitoring: Search service monitoring should fail if search results only return exact matches and suggestions don't work - https://phabricator.wikimedia.org/T101914#1524133 (10chasemp) >>! In T101914#1505329, @Dzahn wrote: > Which monitoring is it about? Icinga... [15:08:14] PROBLEM - puppet last run on scb1002 is CRITICAL Puppet has 41 failures [15:08:19] hello service cluster [15:08:19] (03CR) 10coren: [C: 032] "Until issue is found." [puppet] - 10https://gerrit.wikimedia.org/r/230555 (owner: 10coren) [15:08:24] akosiaris: ^^ ? [15:08:35] PROBLEM - zotero on scb1002 is CRITICAL: Connection refused [15:08:46] thcipriani: you didn't deply yet, i guess. the service cluster complains ... [15:09:12] matanya: he is deploying extension, not cxserver (sca) :) [15:09:12] matanya: nope, not just yet, double checking patches still. [15:09:34] * matanya shuts up [15:11:22] !log thcipriani Synchronized php-1.26wmf17/extensions/ContentTranslation/api/ApiContentTranslationPublish.php: SWAT: Enable scrubWikitext=1 in HTML to wikitext conversion using parsoid [[gerrit:230381]] (duration: 00m 13s) [15:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:32] ^ kart_ check please :) [15:11:39] Stupid. Stupid. Stupid. With everyone saying "use python3" I never considered to actually check that the other hosts might not have had it. [15:12:27] And the script works fine in python 2 anyways. [15:12:28] <_joe_> everyone =? [15:12:28] aharoni: test :) [15:12:38] tsting [15:12:39] testing [15:12:45] _joe_: Mostly Yuvi, but I've had others. My fault entirely. [15:12:54] <_joe_> Coren: never said that to anyone, I also thought it was pretty peculiar to set python3 in the shebang [15:13:10] I've been coding all python 3 for labs for weeks and I just kept going. [15:13:13] My bad. [15:13:19] _joe_: I didn't say you. :-) [15:13:26] so just add an install for the python3 package? [15:13:37] kart_, thcipriani - works, thank you. [15:13:42] That was quick and painless. [15:13:52] <_joe_> Coren: yeah :) [15:13:59] bblack: Is this what we want? I mean, it makes sense to do python 3 to be forward looking but that'll affect pretty much every jessie host. [15:14:01] <_joe_> but what bblack said is the best thing to do [15:14:02] aharoni: awesome—quick and painless is what we aim for—thanks for testing. [15:14:05] <_joe_> Coren: yes [15:14:20] Coren: it can't break anything I don't think, it's just a new interpreter than nothing else is using yet [15:14:27] <_joe_> Coren: installing python 3 and all the needed packages though, not just the interpreter [15:14:34] <_joe_> I think we need python3-urllib [15:14:37] <_joe_> proabbly [15:14:54] _joe_: Isn't python2 the default unless the shebang is explicit? [15:14:58] yes [15:15:04] <_joe_> yes [15:15:45] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:15:55] RECOVERY - puppet last run on cp3018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:22] thcipriani: thanks! [15:16:52] (03CR) 10Tim Landscheidt: "That's what I specifically want to avoid :-). The (IMHO) nice thing about … "specific" configurations vs. flat files is that you can clar" [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) (owner: 10Tim Landscheidt) [15:17:15] RECOVERY - puppet last run on cp2014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:30] Hm. Add a require_package in the module like we'd do with libs, or somewhere in base for all jessies? [15:17:56] require_package for exactly what you need for this patch in the right places is better, IMHO [15:18:04] <_joe_> yes [15:18:13] Yeah, that was my first reflex. I love that function. :-) [15:18:19] <_joe_> require_package python3, [15:18:41] <_joe_> require_package 'python3', 'python3-urllib3' probably [15:18:43] (03PS1) 10coren: Redo "nrpe: Merge check_systemd_unit_lastrun into _state" [puppet] - 10https://gerrit.wikimedia.org/r/230556 [15:18:54] <_joe_> Coren: the internals of require_package are scary [15:18:57] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=1 dev=sdb failed - https://phabricator.wikimedia.org/T108561#1524171 (10Papaul) I will have the drive on site tomorrow. [15:19:04] <_joe_> ori had to go dig into the bowels of puppet [15:19:08] _joe_: I like the externals. :-) [15:19:23] <_joe_> that also happens with the "role" function btw [15:20:32] (03PS2) 10coren: Redo "nrpe: Merge check_systemd_unit_lastrun into _state" [puppet] - 10https://gerrit.wikimedia.org/r/230556 [15:21:31] _joe_, bblack: ^^ just a re-revert with the require_package added in. [15:27:04] PROBLEM - puppet last run on ms-be2013 is CRITICAL Puppet has 1 failures [15:27:05] PROBLEM - Router interfaces on cr2-codfw is CRITICAL host 208.80.153.193, interfaces up: 102, down: 2, dormant: 0, excluded: 1, unused: 0BRae1: down - Core: asw-a-codfw:ae2BRet-0/0/0: down - asw-a-codfw:et-7/0/52 {#10706} [40Gbps Cu]BR [15:27:29] (03CR) 10coren: [C: 032] "Trivial fix." [puppet] - 10https://gerrit.wikimedia.org/r/230556 (owner: 10coren) [15:27:38] * Coren goes to test quickly. [15:31:01] Yep. That was just it. [15:31:26] * Coren watches icinga like a hawk. A hawk! [15:31:39] "trivial fix" are also another set of famous last words :-) [15:33:44] PROBLEM - puppet last run on cp4018 is CRITICAL Puppet has 1 failures [15:33:54] RECOVERY - Router interfaces on cr2-codfw is OK host 208.80.153.193, interfaces up: 112, down: 0, dormant: 0, excluded: 1, unused: 0 [15:33:58] jynus: It's the tier below "what could go wrong" and "it can't get any worse" :-) [15:34:05] PROBLEM - puppet last run on cp3003 is CRITICAL Puppet has 1 failures [15:34:06] PROBLEM - puppet last run on cp4013 is CRITICAL Puppet has 1 failures [15:34:24] no, there is also commit messages such as "YOLO" [15:34:49] "this is going to fail fatally" [15:34:52] jynus: That one isn't tempting fate, it's poking fate with a cattle prod. :-) [15:35:17] PROBLEM - puppet last run on etcd1002 is CRITICAL Puppet has 1 failures [15:37:34] Ah, good, the unknowns are going away now. [15:40:12] 6operations, 7network: cr1/cr2-codfw QSFP+ errors every second for qsfp-0/0/0 - https://phabricator.wikimedia.org/T92616#1524224 (10faidon) 5Open>3Resolved cr1-codfw errors magically disappeared some time ago. However I saw some really strange behavior while trying to reproduce and troubleshoot this **and*... [15:40:41] bblack: BTW, about unknowns... just distaste or there is a more profound reason? To me, 'unknown' means "I can't figure it out and/or I have no datapoint", not the semantics you expect? [15:41:33] Coren: my issue there is we defined that check because we need positive confirmation that nothing's wrong. AFAIK any "UNKNOWN" state is a critical failure. If it can't be monitored, we don't know that it's up. [15:42:02] I just philosophically object to the whole idea that a check that's capable of being in a CRITICAL state can instead fall into an UNKNOWN state that doesn't alert, where now we don't know whether it should be CRITICAL or not. [15:42:25] (03CR) 10Merlijn van Deen: "Sorry, I think my explanation maybe wasn't very clear. What I mean is that it would be nice if a given puppet manifest always gives the sa" [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) (owner: 10Tim Landscheidt) [15:42:45] or another way to think of it "UNKNOWN" is always "CRITICAL" in the sense that it's a critical bug in a check that could return CRITICAL. [15:43:30] bblack: Hm. Well, the original test had the same semantics - it would UNKNOWN if it was badly invoke or, somehow, didn't manage to get the status. [15:43:43] yeah lots of our checks do, including icinga default ones [15:44:00] bblack: Perhaps then our issue should be to alert on unknowns? [15:44:10] unknown only makes sense if you're operating on the assumption that everything's ok in the absence of a positive confirmation of error [15:44:25] I prefer to operate on the assumption that everything's broken without a positive confirmation of OK state [15:45:14] RECOVERY - check_disk on backup4001 is OK: DISK OK - free space: / 873901 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 868444 MB (30% inode=99%) [15:45:44] "No sysadmin was ever in trouble for having been too paranoid"? It makes sense, but I don't think the right thing is to fold "I can't tell" with "It's broken" but to make sure that "I can't tell is surfaced at least as loudly" [15:46:13] I.e.: UKNOWNs should ring alarms. [15:46:28] alerting on UNKNOWN is probably the right answer. but if we just flip that on without fixing the common cases first it's going to spamalot [15:46:40] Tru dat. [15:46:40] we currently have lots of checks doing that sporadically by misdesign [15:47:14] Arguably, switching those unknowns to crits would have the same effect regarless. [15:47:42] well UNKNOWN is baked into standard checks anyways, and it would suck to go customize them all just for that [15:49:13] That too. [15:49:35] (03CR) 10BryanDavis: "> I'm assuming the statsd ruby client underneath will send a statsd sample for each line? depending on the volume it might pose problems, " [puppet] - 10https://gerrit.wikimedia.org/r/230233 (https://phabricator.wikimedia.org/T100735) (owner: 10BryanDavis) [15:49:38] Plus the distinction has value: you'd check different things investigating either. [15:53:34] RECOVERY - puppet last run on ms-be2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:25] RECOVERY - puppet last run on etcd1002 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:00:05] RECOVERY - puppet last run on cp4018 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:00:35] RECOVERY - puppet last run on cp3003 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:00:35] RECOVERY - puppet last run on cp4013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:02:08] 6operations, 5Patch-For-Review: rotate phab access logs more often on iridium - https://phabricator.wikimedia.org/T108503#1524294 (10Joe) 5Open>3Resolved p:5Triage>3High [16:05:35] (03PS11) 10Giuseppe Lavagetto: puppet-compiler: first commit [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/228849 (https://phabricator.wikimedia.org/T96802) [16:06:05] <_joe_> akosiaris, bblack ^^ just needs two more files and it's ready to use I guess [16:06:15] :) [16:07:28] <_joe_> the code is way cleaner than in the old incarnation, it should be easier to find out what screws up in case [16:08:21] <_joe_> one of the two is creating an html page and adding some css, which knowing myself will probably require me 10 times the effort that the rest required [16:21:59] (03PS3) 10Dzahn: Add Wikimedia Australia blog to the English Planet [puppet] - 10https://gerrit.wikimedia.org/r/230391 (owner: 10Amire80) [16:22:44] (03CR) 10Dzahn: [C: 032] Add Wikimedia Australia blog to the English Planet [puppet] - 10https://gerrit.wikimedia.org/r/230391 (owner: 10Amire80) [16:23:25] (03PS2) 10Dzahn: Add https://blogdukiwi.wordpress.com/ to the French Planet [puppet] - 10https://gerrit.wikimedia.org/r/230497 (owner: 10Amire80) [16:24:22] (03PS3) 10Dzahn: Add https://blogdukiwi.wordpress.com/ to the French Planet [puppet] - 10https://gerrit.wikimedia.org/r/230497 (owner: 10Amire80) [16:24:33] (03CR) 10Dzahn: [C: 032] Add https://blogdukiwi.wordpress.com/ to the French Planet [puppet] - 10https://gerrit.wikimedia.org/r/230497 (owner: 10Amire80) [16:26:03] (03PS1) 10coren: labstore: add timers for backups [puppet] - 10https://gerrit.wikimedia.org/r/230569 (https://phabricator.wikimedia.org/T106474) [16:40:12] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 5 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1524393 (10coren) Labstore1001 has been reinstalled (and is pristine from the puppet manifest) and all I/O tests are fine, but th... [16:42:17] 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1524403 (10coren) [16:42:20] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 5 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1498287 (10coren) [16:42:25] (03PS2) 10Faidon Liambotis: Add cr1-eqord and cr1-eqdfw to monitoring tools [puppet] - 10https://gerrit.wikimedia.org/r/230309 [16:42:32] (03CR) 10Faidon Liambotis: [C: 032] Add cr1-eqord and cr1-eqdfw to monitoring tools [puppet] - 10https://gerrit.wikimedia.org/r/230309 (owner: 10Faidon Liambotis) [16:43:31] 6operations, 10vm-requests, 5Patch-For-Review, 7Pybal: codfw: 3 VM request for PyBal - https://phabricator.wikimedia.org/T107901#1524407 (10ori) Thanks very much for the kick-ass setup and the fast turnaround! [16:43:55] 6operations, 10ops-codfw: es2007 degraded RAID - disk failure - https://phabricator.wikimedia.org/T108592#1524408 (10jcrespo) 3NEW [16:43:57] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 5 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1524417 (10coren) [16:44:08] apergos: ping re: https://phabricator.wikimedia.org/T107405 [16:45:13] 6operations: Backport ffmpeg 2.7.3 to Trusty - https://phabricator.wikimedia.org/T107313#1524425 (10ori) @fgiunchedi, any progress on this? [16:46:54] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Package mwbzutils for Trusty - https://phabricator.wikimedia.org/T107405#1524429 (10ori) @ArielGlenn, any progress on this? [16:49:00] ACKNOWLEDGEMENT - RAID on es2007 is CRITICAL 1 failed LD(s) (Degraded) Jcrespo T108592 [16:49:53] (03PS3) 10Alexandros Kosiaris: Split scb100{1,2} in the own role [puppet] - 10https://gerrit.wikimedia.org/r/230552 [16:50:23] !log ori Synchronized php-1.26wmf17/extensions/wikihiero: I2089b21fc: Updated mediawiki/core Project: mediawiki/extensions/wikihiero (duration: 00m 12s) [16:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:59] 6operations, 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1524449 (10Andrew) Proposed reboot schedule here: https://wikitech.wikimedia.org/wiki/Virt_node_upgrade_schedule [16:52:16] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 Telia (IC-313592) {#?} [10Gbps DWDM]BR [16:52:17] is etherpad down? [16:52:26] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1524453 (10fgiunchedi) [16:52:29] 6operations: Backport ffmpeg 2.7.3 to Trusty - https://phabricator.wikimedia.org/T107313#1524451 (10fgiunchedi) 5Open>3Resolved @ori progress indeed but communication breakdown as I was following up on https://phabricator.wikimedia.org/T103335 instead and forgot about this, anyways ffmpeg 2.7.3 is available... [16:52:29] (03PS1) 10Ottomata: Override kafka jmxtrans metrics to test new config for version 0.8.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/230576 (https://phabricator.wikimedia.org/T106581) [16:53:12] (03CR) 10jenkins-bot: [V: 04-1] Override kafka jmxtrans metrics to test new config for version 0.8.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/230576 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [16:53:14] (03PS4) 10Alexandros Kosiaris: Split scb100{1,2} in their own role [puppet] - 10https://gerrit.wikimedia.org/r/230552 [16:53:20] (03PS5) 10Alexandros Kosiaris: Split scb100{1,2} in their own role [puppet] - 10https://gerrit.wikimedia.org/r/230552 [16:53:54] (03PS2) 10Ottomata: Override kafka jmxtrans metrics to test new config for version 0.8.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/230576 (https://phabricator.wikimedia.org/T106581) [16:53:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Split scb100{1,2} in their own role [puppet] - 10https://gerrit.wikimedia.org/r/230552 (owner: 10Alexandros Kosiaris) [16:54:03] 6operations: Backport ffmpeg 2.7.3 to Trusty - https://phabricator.wikimedia.org/T107313#1524458 (10ori) @fgiunchedi, that's awesome! [16:54:26] 6operations: Backport ffmpeg 2.7.3 to Trusty - https://phabricator.wikimedia.org/T107313#1524459 (10ori) [16:54:29] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1524461 (10ori) [16:54:33] (03CR) 10jenkins-bot: [V: 04-1] Override kafka jmxtrans metrics to test new config for version 0.8.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/230576 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [16:55:18] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1524463 (10fgiunchedi) [16:55:22] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1524462 (10fgiunchedi) [16:55:28] (03PS3) 10Ottomata: Override kafka jmxtrans metrics to test new config for version 0.8.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/230576 (https://phabricator.wikimedia.org/T106581) [16:56:05] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [16:56:25] RECOVERY - puppet last run on scb1002 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:56:26] 6operations, 3Discovery-Maps-Sprint: Postgres replication is not working - https://phabricator.wikimedia.org/T108545#1524464 (10MaxSem) This is apparently the cause of slaves not having spatial indexes: maps-test2001: ``` gis=> \d+ planet_osm_polygon; . . . Indexes: "planet_osm_polygon_index" gist (way)... [16:56:28] (03PS4) 10Ottomata: Override kafka jmxtrans metrics to test new config for version 0.8.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/230576 (https://phabricator.wikimedia.org/T106581) [16:56:33] (03CR) 10Ottomata: [C: 032 V: 032] Override kafka jmxtrans metrics to test new config for version 0.8.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/230576 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [16:58:04] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.003 second response time [16:58:38] !log ori Synchronized php-1.26wmf17/resources/Resources.php: I2089b21fc: Load mediawiki.legacy.commonPrint styles with a media type property (1/2) (duration: 00m 11s) [16:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:02] !log ori Synchronized php-1.26wmf17/includes/OutputPage.php: I2089b21fc: Load mediawiki.legacy.commonPrint styles with a media type property (2/2) (duration: 00m 11s) [16:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:16] RECOVERY - puppet last run on scb1001 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:01:33] 7Blocked-on-Operations, 6operations, 10Parsoid: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1524498 (10ssastry) *bump* Any progress on this? We continue to use the shell script loop .. but on occasion, because of tran... [17:05:35] (03CR) 10Merlijn van Deen: "What's the typical runtime of the backups? If they can take more than an hour, the staggering might not be enough." [puppet] - 10https://gerrit.wikimedia.org/r/230569 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [17:06:29] 6operations, 3Discovery-Maps-Sprint: Postgres replication is not working - https://phabricator.wikimedia.org/T108545#1524515 (10Yurik) All queries for tile generation are hitting 2001 only for now. [17:06:48] (03CR) 10Merlijn van Deen: [C: 031] haproxy: Move check_haproxy to module itself [puppet] - 10https://gerrit.wikimedia.org/r/228712 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [17:08:36] (03CR) 10coren: "The duration varies from time to time, and from filesystem to filesystem. From as low as 20 mins to as many as 12-14h." [puppet] - 10https://gerrit.wikimedia.org/r/230569 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [17:11:48] (03PS1) 10Ottomata: Better alias name for All topic metrics from kafka [puppet] - 10https://gerrit.wikimedia.org/r/230577 (https://phabricator.wikimedia.org/T106581) [17:12:55] !log stopped postgres on maps-test200{2,3,4} [17:12:59] (03CR) 10Ottomata: [C: 032] Better alias name for All topic metrics from kafka [puppet] - 10https://gerrit.wikimedia.org/r/230577 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [17:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:48] (03CR) 10Merlijn van Deen: [C: 031] "OK. The patch looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/230569 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [17:19:40] akosiaris, could you +2 https://gerrit.wikimedia.org/r/#/c/230549/ [17:20:11] YuviPanda, btw, i think ^ is a perfect SWAT candidate :) [17:21:02] Possibly :) [17:21:10] 6operations: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1524570 (10Dzahn) there was an API outage due to poolcounter server dropping packages due to an issue with ferm rules: https://wikitech.wikimedia.org/wiki/Incident... [17:21:14] Puppetswat starts only next wed tho [17:22:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add account info to kartotherian config erb file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/230549 (owner: 10Yurik) [17:23:41] akosiaris, when you create it, i might have to update the kartotherian service patch - https://gerrit.wikimedia.org/r/#/c/229727/ [17:26:52] akosiaris, should i update the patch now? [17:27:08] yurik: yup [17:27:32] PROBLEM - puppet last run on netmon1001 is CRITICAL puppet fail [17:27:51] oh netmon, what now... [17:29:51] RECOVERY - puppet last run on netmon1001 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:30:21] akosiaris, what name will you use in hierra for the pgsql password? [17:30:28] postgresql::master::kartotherian_pass [17:30:29] ? [17:30:34] and what about cassandra [17:31:16] kartotherian_user ? [17:31:27] and kartotherian_pass respectively ? [17:31:31] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1524616 (10CaitVirtue) Hey everyone--Really sorry to do this, but Jimmy Wales had a scheduling conflict with the October 1st date... [17:34:47] PROBLEM - Router interfaces on cr1-eqord is CRITICAL host 208.80.154.198, interfaces up: 33, down: 3, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Peering: ! Equinix Chicago (SR 17915277) {#?} [10Gbps DF]BRem1: down - BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 Telia (IC-313592) {#?} [10Gbps DWDM]BR [17:37:18] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL host 208.80.153.198, interfaces up: 33, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Peering: ! Equinix Dallas (SR 17915024) {#?} [10Gbps DF]BRem1: down - BR [17:40:35] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1524661 (10brion) Ok, this gets VP9 transcodes with ffmpeg working! But ffmpeg2theora is now faili... [17:42:05] 6operations, 3Discovery-Maps-Sprint: Postgres replication is not working - https://phabricator.wikimedia.org/T108545#1524663 (10akosiaris) stop postgres on maps-test200{2,3,4} and resyncing right now [17:42:50] bblack: should I file a separate blocking bug for usage of w.wiki? [17:42:56] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 2 others: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1524666 (10Andrew) [17:45:16] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor comments, mostly looks ok" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [17:45:18] (03PS2) 10Yurik: Add account info to kartotherian config erb file [puppet] - 10https://gerrit.wikimedia.org/r/230549 [17:45:24] bblack, ^ [17:45:51] (03PS5) 10Alexandros Kosiaris: Apertium: Add -j -m and parameters [puppet] - 10https://gerrit.wikimedia.org/r/230108 (owner: 10KartikMistry) [17:45:57] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Apertium: Add -j -m and parameters [puppet] - 10https://gerrit.wikimedia.org/r/230108 (owner: 10KartikMistry) [17:46:21] (03PS1) 10Eevans: update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/230582 (https://phabricator.wikimedia.org/T101764) [17:46:22] 6operations, 7network: Set up cr1-eqord & cr1-eqdfw - https://phabricator.wikimedia.org/T89227#1524684 (10faidon) [17:46:25] 6operations, 10ops-codfw: Please rack & connect the Tampa MX80s in row D - https://phabricator.wikimedia.org/T84658#1524682 (10faidon) 5Open>3Resolved These are now both set up and connected to our network. [17:46:47] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1524693 (10faidon) [17:46:49] 6operations, 7network: Set up cr1-eqord & cr1-eqdfw - https://phabricator.wikimedia.org/T89227#1524688 (10faidon) 5Open>3Resolved a:3faidon These are now connected and set up (still @ codfw, though). [17:46:58] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1524694 (10brion) Sample command line for VP9->ogv conversion with ffmpeg2theora: source file: ht... [17:47:52] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1073784 (10faidon) [17:48:48] faidon@carbon:~$ sudo puppet agent -vt [17:48:48] Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'reason not specified'); [17:48:53] 6 hours ago [17:49:00] please always specify a reason with your name :) [17:49:03] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1524704 (10KartikMistry) https://gerrit.wikimedia.org/r/#/c/230108/ is... [17:49:04] akosiaris or godog maybe? [17:49:59] (03PS1) 10Alexandros Kosiaris: apertium: Add respawn to upstart config file [puppet] - 10https://gerrit.wikimedia.org/r/230583 [17:50:20] yeah, that was me [17:50:22] enabled [17:50:38] and I never really needed to disable it, which is why I never gave I reason [17:50:40] akosiaris: thanks for fixing my patch :) [17:50:56] (03CR) 10Yurik: Added tilerator service, and granted kartotherian OSM DB read access (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [17:51:00] (03PS4) 10Yurik: Added tilerator service, and granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) [17:51:03] akosiaris, ^ [17:51:11] akosiaris: There are more fixes in upstream, will update package after updating in Debian. [17:51:29] akosiaris, i fixed both patches. bblack - sorry, for pinging you earlier - i meant akosiaris [17:51:41] (03CR) 10jenkins-bot: [V: 04-1] Added tilerator service, and granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [17:51:43] (03CR) 10Alexandros Kosiaris: [C: 032] apertium: Add respawn to upstart config file [puppet] - 10https://gerrit.wikimedia.org/r/230583 (owner: 10Alexandros Kosiaris) [17:56:15] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1524740 (10akosiaris) https://gerrit.wikimedia.org/r/#/c/230108/ has b... [17:56:36] (03CR) 10Mobrovac: [C: 031] update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/230582 (https://phabricator.wikimedia.org/T101764) (owner: 10Eevans) [17:57:23] (03PS4) 10BryanDavis: logstash: Count MediaWiki log events with statsd [puppet] - 10https://gerrit.wikimedia.org/r/230233 (https://phabricator.wikimedia.org/T100735) [17:57:25] (03PS3) 10BryanDavis: logstash: Enable doc_values in template mapping [puppet] - 10https://gerrit.wikimedia.org/r/230250 (https://phabricator.wikimedia.org/T74930) [17:57:27] (03PS26) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [17:57:29] (03PS17) 10BryanDavis: Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [17:58:21] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1524746 (10akosiaris) > ``` > #!/bin/bash > time for i in {1..10}; do... [17:59:44] (03CR) 10BryanDavis: [C: 04-1] "Hold until scheduled deploy window of 2015-08-10T16:00Z" [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) (owner: 10BryanDavis) [18:00:09] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580#1524753 (10BBlack) p:5Triage>3Normal [18:05:50] 6operations, 6Services, 10hardware-requests, 5Patch-For-Review: Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1524782 (10akosiaris) Cluster has been installed and is ready to start accepting services [18:08:14] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1524797 (10akosiaris) [18:08:17] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1524798 (10akosiaris) [18:08:20] 6operations, 6Services, 10hardware-requests, 5Patch-For-Review: Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1524795 (10akosiaris) 5Open>3Resolved Resolving this [18:10:35] (03PS1) 10Eevans: update cassandra-metrics-collector version [puppet] - 10https://gerrit.wikimedia.org/r/230589 (https://phabricator.wikimedia.org/T101764) [18:13:12] 6operations, 6Phabricator: Disk space monitoring for Phabricator - https://phabricator.wikimedia.org/T108608#1524825 (10Mattflaschen) 3NEW [18:13:37] 6operations, 6Phabricator, 7Monitoring: Disk space monitoring for Phabricator - https://phabricator.wikimedia.org/T108608#1524832 (10Krenair) [18:19:00] 7Blocked-on-Operations, 6Collaboration-Team-Backlog, 10Collaboration-Team-Current, 10Flow, and 2 others: Separate reference tables by wiki - https://phabricator.wikimedia.org/T107204#1524850 (10Mattflaschen) [18:24:46] jynus, we would appreciate it if you or springle could take a look at https://gerrit.wikimedia.org/r/#/c/136280/ when you're able. It was previously reviewed, so it's not a whole new thing, but there has been some small rebasing/adjustments. [18:24:52] jynus, until this is resolved, we may have issues in production, like https://phabricator.wikimedia.org/T67802#1497491 [18:26:02] 6operations, 6Phabricator, 7Monitoring: Disk space monitoring for Phabricator - https://phabricator.wikimedia.org/T108608#1524863 (10chasemp) p:5Triage>3Normal [18:27:47] matt_flaschen, will have a look at it [18:28:35] 6operations, 3Discovery-Maps-Sprint: Add user/passwords info for the production configuration file - https://phabricator.wikimedia.org/T108610#1524867 (10Yurik) 3NEW a:3Yurik [18:29:07] jynus, thanks. I appreciate it. Note springle already applied an earlier version of phase 1 to production. [18:29:18] (03PS3) 10Yurik: Add usernames+passwords to kartotherian config erb file [puppet] - 10https://gerrit.wikimedia.org/r/230549 (https://phabricator.wikimedia.org/T108610) [18:29:33] will have to investigate [18:30:37] (03CR) 10Mobrovac: [C: 031] "lgtm, once the other patch has been merged / deployed." [puppet] - 10https://gerrit.wikimedia.org/r/230589 (https://phabricator.wikimedia.org/T101764) (owner: 10Eevans) [18:38:19] 6operations, 10Traffic: Fix Varnish TTLs across the board - https://phabricator.wikimedia.org/T108612#1524905 (10BBlack) 3NEW [18:38:56] (03PS5) 10Yurik: Added tilerator service, and granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) [18:39:38] (03CR) 10jenkins-bot: [V: 04-1] Added tilerator service, and granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [18:41:48] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1524920 (10GWicke) 3NEW [18:42:31] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1524929 (10GWicke) [18:44:58] (03CR) 10Yurik: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [18:53:59] 6operations, 6Services, 10hardware-requests, 5Patch-For-Review: Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1524975 (10akosiaris) For posterity and clarity's sake, I just briefed ops in meeting about the hardware, got a implicit OK w... [18:55:52] 6operations, 10RESTBase, 10RESTBase-Cassandra: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1524978 (10GWicke) We talked about this at the last hardware planning meeting. The consensus was to keep things simple for now, and set up additional IPs & bind Cassan... [18:56:18] legoktm: your mail is kmehta@wikimedia.org ? [18:56:37] matanya: yes, but I typically use legoktm@wm.o [18:57:14] legoktm: the guy that last his user page asked me to ask you permission to share your email, he wants to send you something [18:57:24] ok [18:57:29] cool, thanks [18:57:29] it should be public somewhere [18:57:44] not on wm.org or mw.org [18:58:51] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1524989 (10RobH) NOTE: Mark has approved giving the services team (as individual logins) access to ack/suspend/control icinga monitori... [18:59:06] (03PS3) 10Dzahn: admin: add johnflewis to mailman-admins [puppet] - 10https://gerrit.wikimedia.org/r/230134 (https://phabricator.wikimedia.org/T108082) (owner: 10John F. Lewis) [18:59:33] matanya: https://meta.wikimedia.org/wiki/User:Legoktm_%28WMF%29 :P [19:00:09] (03CR) 10Dzahn: [C: 032] "approved in ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/230134 (https://phabricator.wikimedia.org/T108082) (owner: 10John F. Lewis) [19:00:22] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1525006 (10RobH) a:3RobH This is changing an access file for icinga, so I'd like to do it (stealing task) @Gwicke: Can you go ahead... [19:00:31] thanks legoktm [19:00:32] 6operations, 10RESTBase, 10RESTBase-Cassandra: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1525011 (10GWicke) p:5Normal>3High [19:01:20] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant log file access to Yurik & Maxsem on maps-test200{1-4} - https://phabricator.wikimedia.org/T106629#1525015 (10akosiaris) [19:01:22] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1525016 (10akosiaris) [19:03:09] (03PS6) 10Yurik: Added tilerator service, and granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) [19:03:28] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1525019 (10GWicke) @robh: currently it's primarily eevans, mobrovac & gwicke (myself). There is also Petr (ppchelko), but he isn't doi... [19:03:54] (03CR) 10jenkins-bot: [V: 04-1] Added tilerator service, and granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [19:05:00] (03CR) 10Merlijn van Deen: "Currently projects have gid=50062 (bastion) to 52653 (nonfreewiki), so somewhere in the 60000-range is probably OK for now. The only refer" [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) (owner: 10Tim Landscheidt) [19:05:16] (03PS7) 10Yurik: Added tilerator service, and granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) [19:22:46] !log ori Synchronized php-1.26wmf17/includes: I9a1aa76de: Moved ObjectCacheSessionHandler renewal logic to wfSetupSession() (duration: 00m 16s) [19:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:41] (03CR) 10Gergő Tisza: [C: 031] access: stat1002 access for tgr [puppet] - 10https://gerrit.wikimedia.org/r/230510 (owner: 10Matanya) [19:29:53] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible (or $deploy_system in general)? - https://phabricator.wikimedia.org/T107532#1525074 (10GWicke) So, to summarize the discussion: - ops have not clearly stated whether they prefer serv... [19:33:05] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1525091 (10GWicke) [19:34:09] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1524920 (10GWicke) [19:35:45] (03CR) 10Merlijn van Deen: [C: 031] "Code looks good to me. Unless you want to change to the gid=49000-49999 range, this can be merged." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) (owner: 10Tim Landscheidt) [19:38:42] matt_flaschen, flow sql patches are not trivial, and I have to check what has been done already [19:39:20] the golden rule is: if an update has more than 1000 rows, separate it in batches [19:39:52] if an alter has more than 1000-10000 rows, let us do it [19:40:13] will check it more thoroughly tomorrow [19:41:05] or ask Sean in a few hours if he is already in the known [19:50:49] (03PS1) 10Chad: MW releases: create shared build directory in /srv [puppet] - 10https://gerrit.wikimedia.org/r/230601 [19:51:05] Anyone got a quick second to poke ^, should be trivial and will make me + csteipp's lives easier. [19:51:59] (03CR) 10jenkins-bot: [V: 04-1] MW releases: create shared build directory in /srv [puppet] - 10https://gerrit.wikimedia.org/r/230601 (owner: 10Chad) [19:57:36] !log synced new kartotherian [19:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:05] gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150810T2000). [20:01:08] ok, starting parsoid deploy [20:03:51] (03Abandoned) 10Hashar: Use libav instead of ffmpeg on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/222999 (https://phabricator.wikimedia.org/T95002) (owner: 10Hashar) [20:08:37] jynus, thank you for looking at it. It sounds like you might be saying we should do https://gerrit.wikimedia.org/r/#/c/136280/15/db_patches/patch-reference_wiki-phase2.sql in batches. [20:09:03] 6operations, 6Release-Engineering: [Spike] Try out hack ( matt_flaschen, you have access to the data, SELECT count(*) the number of rows affected on each wiki [20:09:53] jynus, will do. [20:10:06] It is probably more than 1000 total, though. [20:10:07] and if it is too large in some cases, put a LIMIT to UPDATES [20:10:41] assuming it will be executed only once per row [20:11:04] the alters there sould be trivial, but I would have to confirm them [20:11:07] I think we need to add a where in that case, but not big deal. [20:11:22] yes, you get the idea [20:11:49] will add more information tomorrow, it is getting too late here [20:11:59] !log deployed parsoid version 7b554ce2f [20:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:13:12] 6operations, 5Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1525180 (10hashar) [20:13:15] 6operations, 5Continuous-Integration-Isolation, 5Patch-For-Review: Remove hashar and dduvall root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1525178 (10hashar) 5stalled>3Resolved Confirmed. Thank you for the cleanup! [20:20:43] (03PS1) 10coren: Change test for log_type to a list [software] - 10https://gerrit.wikimedia.org/r/230645 [20:20:51] jynus: ^^ [20:22:59] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1525310 (10RobH) [20:25:03] 6operations, 6Phabricator, 7Monitoring: Disk space monitoring for Phabricator - https://phabricator.wikimedia.org/T108608#1525327 (10chasemp) 5Open>3Resolved a:3chasemp This already exists now inherited from `base` and looks like > check_disk -w 6% -c 3% -l -e -A -i "/srv/sd[a-b][1-3]" > DISK OK| /=... [20:26:52] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1525340 (10RobH) [20:31:01] 6operations, 6Phabricator, 7Monitoring: Disk space monitoring for Phabricator - https://phabricator.wikimedia.org/T108608#1525380 (10Dzahn) we already have this because check_disk is a standard on all servers includin the "standard" class. since phabricator is running on iridium the link is: https://icinga... [20:39:10] (03CR) 10Dzahn: "Alex, does it look ok? Since i amended to Moritz' changeset here it would be self-review" [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [20:40:01] (03CR) 10Dzahn: "i added the cluster hostnames in https://gerrit.wikimedia.org/r/#/c/229203/" [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [20:43:10] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1525420 (10RobH) [20:45:11] (03CR) 10Dzahn: "i did planet and grafana already in separate small patches" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [20:46:49] (03PS1) 10Dzahn: grafana: prepare for Apache 2.4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/230650 (https://phabricator.wikimedia.org/T105008) [20:47:33] 6operations, 6Release-Engineering: [Spike] Try out hack (>! In T91590#1525171, @hashar wrote: > Maybe we should just close this task until we open up the discussion to migrate to Hack? Meanwhile, that is yet another opened... [20:51:14] (03CR) 10Dzahn: "ottomata: +1 now after your comments on PS1?" [puppet] - 10https://gerrit.wikimedia.org/r/229716 (owner: 10Muehlenhoff) [20:51:18] (03CR) 10Tim Landscheidt: "Thanks for looking that up. I think in that case I would keep gid_range 65000-65500 as that is what is currently configured in the live T" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) (owner: 10Tim Landscheidt) [20:57:04] !log ori Synchronized php-1.26wmf17/includes/cache/MessageCache.php: I2089b21fc: MessageCache: use APC for local caching, rather than files (duration: 00m 12s) [20:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:32] (03PS1) 10Andrew Bogott: replace $::instanceproject with $::labsproject [puppet] - 10https://gerrit.wikimedia.org/r/230652 (https://phabricator.wikimedia.org/T93684) [20:59:00] (03PS2) 10coren: Change test for log_type to a list [software] - 10https://gerrit.wikimedia.org/r/230645 [21:02:18] PROBLEM - puppet last run on iridium is CRITICAL Puppet has 1 failures [21:09:48] PROBLEM - Varnishkafka log producer on cp3047 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:12:38] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:15:58] RECOVERY - Varnishkafka log producer on cp3047 is OK: PROCS OK: 1 process with command name varnishkafka [21:18:07] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1525569 (10Malyacko) [21:21:56] (03CR) 10Dzahn: "@rush thoughts on the in-line comment re: diamond?" [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [21:25:03] (03CR) 10Rush: ferm rules for nutcracker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [21:26:21] 6operations, 10Citoid, 10Security-Reviews, 6Security-Team: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#1525587 (10Jdforrester-WMF) 3NEW [21:27:24] 6operations, 10Citoid, 10Security-Reviews, 6Security-Team, 7HTTPS: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#1525597 (10Krenair) [21:27:56] 6operations, 10Citoid, 6Security, 6Security-Team, and 2 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#1525587 (10Krenair) [21:28:42] (03PS2) 10Dzahn: grafana: add role to krypton (VM) [puppet] - 10https://gerrit.wikimedia.org/r/229737 [21:33:46] (03PS3) 10Dzahn: grafana: add role to krypton (VM) [puppet] - 10https://gerrit.wikimedia.org/r/229737 [21:34:45] (03CR) 10Dzahn: [C: 032] grafana: add role to krypton (VM) [puppet] - 10https://gerrit.wikimedia.org/r/229737 (owner: 10Dzahn) [21:37:15] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1525649 (10Dzahn) a:3Dzahn [21:37:55] (03PS1) 10Dzahn: misc-web: switch grafana to backend krypton [puppet] - 10https://gerrit.wikimedia.org/r/230660 (https://phabricator.wikimedia.org/T105008) [21:39:33] (03PS2) 10Dzahn: grafana: prepare for Apache 2.4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/230650 (https://phabricator.wikimedia.org/T105008) [21:40:12] (03CR) 10Dzahn: [C: 032] grafana: prepare for Apache 2.4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/230650 (https://phabricator.wikimedia.org/T105008) (owner: 10Dzahn) [21:40:44] (03Abandoned) 10BBlack: cache::config: remove unused swift backend def [puppet] - 10https://gerrit.wikimedia.org/r/230540 (owner: 10BBlack) [21:41:58] PROBLEM - HTTP on krypton is CRITICAL: Connection refused [21:42:18] ^ me, on it [21:43:22] (03PS1) 10Rush: phab: reduce log history [puppet] - 10https://gerrit.wikimedia.org/r/230661 [21:43:35] (03PS2) 10BBlack: cache::config: replace lvs IP refs with service hostnames [puppet] - 10https://gerrit.wikimedia.org/r/230541 (https://phabricator.wikimedia.org/T108580) [21:46:11] (03PS1) 10Ori.livneh: Build wikiversions.php in addition to wikiversions.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230662 (https://phabricator.wikimedia.org/T108638) [21:48:06] (03CR) 10Ori.livneh: [C: 032] Build wikiversions.php in addition to wikiversions.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230662 (https://phabricator.wikimedia.org/T108638) (owner: 10Ori.livneh) [21:48:11] (03Merged) 10jenkins-bot: Build wikiversions.php in addition to wikiversions.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230662 (https://phabricator.wikimedia.org/T108638) (owner: 10Ori.livneh) [21:48:27] (03PS1) 10Dzahn: Revert "grafana: add role to krypton (VM)" [puppet] - 10https://gerrit.wikimedia.org/r/230664 [21:49:45] (03CR) 10BBlack: [C: 032] cache::config: replace lvs IP refs with service hostnames [puppet] - 10https://gerrit.wikimedia.org/r/230541 (https://phabricator.wikimedia.org/T108580) (owner: 10BBlack) [21:52:28] RECOVERY - HTTP on krypton is OK: HTTP OK: HTTP/1.1 200 OK - 1485 bytes in 0.006 second response time [21:53:04] !log krypton unloaded mod proxy_balancer [21:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:07] (03PS2) 10Dzahn: Revert "grafana: add role to krypton (VM)" [puppet] - 10https://gerrit.wikimedia.org/r/230664 [21:56:08] (03CR) 10Dzahn: "currently breaks Apache 2.4" [puppet] - 10https://gerrit.wikimedia.org/r/229737 (owner: 10Dzahn) [21:56:48] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [21:56:57] PROBLEM - puppet last run on cp2017 is CRITICAL Puppet has 1 failures [21:58:47] PROBLEM - puppet last run on cp1055 is CRITICAL Puppet has 1 failures [22:00:38] PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 1 failures [22:01:00] PROBLEM - puppet last run on cp2002 is CRITICAL Puppet has 1 failures [22:01:00] PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 1 failures [22:02:00] PROBLEM - puppet last run on cp1053 is CRITICAL Puppet has 1 failures [22:02:21] PROBLEM - puppet last run on cp1068 is CRITICAL Puppet has 1 failures [22:03:20] PROBLEM - puppet last run on cp1064 is CRITICAL Puppet has 1 failures [22:03:52] PROBLEM - puppet last run on cp1048 is CRITICAL Puppet has 1 failures [22:04:55] blarg [22:05:00] PROBLEM - puppet last run on cp2004 is CRITICAL Puppet has 1 failures [22:05:14] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1525797 (10RobH) [22:05:41] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1525802 (10Dzahn) simply adding the role to krypton (jessie, Apache 2.4 as opposed to zirconium) currently fails in relation to grafana trying to use Apache mod proxy_balancer revert for now: https... [22:05:41] PROBLEM - puppet last run on cp1059 is CRITICAL Puppet has 1 failures [22:05:53] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1073784 (10RobH) I've emailed our EQ rep Arul to determine our cross-connection info for eqdfw. I have all the IX info for both sites: CH2 Peering: A-Side System Name CH2::01055:WIKIMEDIA FOUNDATION INC Cabl... [22:06:01] PROBLEM - puppet last run on cp2024 is CRITICAL Puppet has 1 failures [22:06:42] PROBLEM - puppet last run on cp1060 is CRITICAL Puppet has 1 failures [22:06:42] PROBLEM - puppet last run on cp2009 is CRITICAL Puppet has 1 failures [22:06:55] (03PS1) 10BBlack: Revert "cache::config: replace lvs IP refs with service hostnames" [puppet] - 10https://gerrit.wikimedia.org/r/230677 [22:07:02] (03CR) 10BBlack: [C: 032] Revert "cache::config: replace lvs IP refs with service hostnames" [puppet] - 10https://gerrit.wikimedia.org/r/230677 (owner: 10BBlack) [22:07:10] (03CR) 10BBlack: [V: 032] Revert "cache::config: replace lvs IP refs with service hostnames" [puppet] - 10https://gerrit.wikimedia.org/r/230677 (owner: 10BBlack) [22:07:28] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1525824 (10RobH) [22:08:01] PROBLEM - puppet last run on cp1047 is CRITICAL Puppet has 1 failures [22:08:01] PROBLEM - puppet last run on cp1052 is CRITICAL Puppet has 1 failures [22:09:41] PROBLEM - puppet last run on cp1061 is CRITICAL Puppet has 1 failures [22:10:01] PROBLEM - puppet last run on cp2005 is CRITICAL Puppet has 1 failures [22:10:31] PROBLEM - puppet last run on cp2008 is CRITICAL Puppet has 1 failures [22:11:10] PROBLEM - puppet last run on cp1067 is CRITICAL Puppet has 1 failures [22:11:20] (03PS1) 10Ori.livneh: Build wikiversions.php in addition to wikiversions.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/230679 (https://phabricator.wikimedia.org/T108638) [22:11:25] ^ bd808 [22:12:14] do we have the disk space for this on snapshot1001? [22:12:31] PROBLEM - puppet last run on cp1046 is CRITICAL Puppet has 1 failures [22:12:54] (03CR) 10jenkins-bot: [V: 04-1] Build wikiversions.php in addition to wikiversions.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/230679 (https://phabricator.wikimedia.org/T108638) (owner: 10Ori.livneh) [22:14:01] PROBLEM - puppet last run on cp1050 is CRITICAL Puppet has 1 failures [22:14:05] nm. I'm dumb and thought it was something completely different [22:16:09] (03PS2) 10Ori.livneh: Build wikiversions.php in addition to wikiversions.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/230679 (https://phabricator.wikimedia.org/T108638) [22:18:30] Where does the bugzilla -> phab redirect code actually live? [22:20:07] akosiaris, back. Do you know if slaves are still rebuilding? [22:20:16] 6operations, 6Release-Engineering: [Spike] Try out hack (3declined a:3demon I don't have a problem with open unassigned bugs, but I guess others do. [22:20:45] Krenair: maybe https://github.com/wikimedia/operations-puppet/blob/acacf97e2df962fef83487a461f3559fa07e4d6f/modules/phabricator/templates/redirect_config.json.erb ? [22:22:02] RECOVERY - puppet last run on cp2017 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [22:22:29] !log restart gitblit [22:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:22:47] (03PS1) 10Ori.livneh: Grafana: Apache 2.3+ compatibility [puppet] - 10https://gerrit.wikimedia.org/r/230682 [22:22:54] Krenair: see also https://github.com/wikimedia/operations-puppet/blob/production/modules/phabricator/manifests/redirector.pp and https://github.com/wikimedia/operations-puppet/blob/production/modules/phabricator/templates/preamble.php.erb [22:23:24] Krenair: i think wmf/puppet/modules/mediawiki/files/apache/sites/redirects/ [22:23:31] thanks [22:23:42] RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [22:23:47] MatmaRex noticed in #mediawiki that bugs.mediawiki.org redirects via bugzilla then to phab [22:25:20] RECOVERY - puppet last run on cp1068 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [22:25:31] PROBLEM - grafana.wikimedia.org on krypton is CRITICAL: Connection refused [22:25:41] PROBLEM - HTTP on krypton is CRITICAL: Connection refused [22:25:51] RECOVERY - puppet last run on cp2002 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [22:26:41] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:26:51] RECOVERY - puppet last run on cp1053 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:26:51] can someone help with gitblit downloads being broken .. it is causing our jenkins jobs to fail .. ex: https://integration.wikimedia.org/ci/job/parsoidsvc-php-parsertests/5332/console [22:27:01] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:28:05] (03PS2) 10Dzahn: Grafana: Apache 2.3+ compatibility [puppet] - 10https://gerrit.wikimedia.org/r/230682 (owner: 10Ori.livneh) [22:28:21] RECOVERY - puppet last run on cp1064 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [22:29:31] subbu: can you reconstruct which URL it is trying to fetch? [22:29:39] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1525888 (10CCogdill_WMF) @CaitVirtue do you want any other aliases for this domain? Brandon set up anna@benefactors.wikimedia.org... [22:29:47] AFAICT it is not echoed out by the test environment build script [22:30:16] (03CR) 10Dzahn: [C: 032] "thank you Ori, this should fix the reason for Ied03598b3d8909" [puppet] - 10https://gerrit.wikimedia.org/r/230682 (owner: 10Ori.livneh) [22:30:57] ori, one sec. let me look there. [22:31:10] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61474 bytes in 0.189 second response time [22:31:21] RECOVERY - puppet last run on cp1048 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:32:01] RECOVERY - puppet last run on cp1060 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:32:11] RECOVERY - puppet last run on cp2009 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [22:32:14] subbu: icinga-wm just emitted a recovery notice for git.wikimedia.org -- maybe a rebuild would work now? [22:32:26] ori, i am sure it is m/w core .. so that it can run .. ok, let me recheck. [22:32:31] RECOVERY - puppet last run on cp2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:33:03] subbu: gitblit restarted and monitoring says it's back [22:33:11] RECOVERY - puppet last run on cp1059 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:33:25] thanks. issued a recheck on the patch .. and we'll know soon. [22:33:30] RECOVERY - puppet last run on cp1047 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [22:33:30] RECOVERY - puppet last run on cp1052 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [22:33:31] RECOVERY - puppet last run on cp2024 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:34:31] RECOVERY - puppet last run on cp1067 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [22:34:55] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1525920 (10Dzahn) also see T104356 [22:35:11] RECOVERY - puppet last run on cp1061 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:35:20] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1525922 (10Dzahn) [22:35:30] RECOVERY - puppet last run on cp2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:36:01] RECOVERY - puppet last run on cp2008 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [22:36:11] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1525929 (10CCogdill_WMF) IBM tells us it's not possible to have a customized domain with an active ssl cert in place; they aren't able... [22:36:30] (03PS1) 10Ottomata: Change varnishkafka webrequest required.acks to 1 [puppet] - 10https://gerrit.wikimedia.org/r/230685 [22:37:27] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/230682 (owner: 10Ori.livneh) [22:38:27] (03CR) 10Ottomata: [C: 032] Change varnishkafka webrequest required.acks to 1 [puppet] - 10https://gerrit.wikimedia.org/r/230685 (owner: 10Ottomata) [22:38:44] (03CR) 10Ottomata: [V: 032] Change varnishkafka webrequest required.acks to 1 [puppet] - 10https://gerrit.wikimedia.org/r/230685 (owner: 10Ottomata) [22:39:31] RECOVERY - puppet last run on cp1050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:40:01] RECOVERY - puppet last run on cp1046 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:41:12] (03PS3) 10Ori.livneh: Grafana: Apache 2.3+ compatibility [puppet] - 10https://gerrit.wikimedia.org/r/230682 [22:41:29] mutante: yeah, I think jenkins is overwhelmed [22:42:09] ori: yep, uploading another one [22:42:45] (03PS1) 10Dzahn: etherpad: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230686 [22:43:19] (03CR) 10Dzahn: [C: 04-1] "not ready to switch just yet" [puppet] - 10https://gerrit.wikimedia.org/r/230660 (https://phabricator.wikimedia.org/T105008) (owner: 10Dzahn) [22:44:14] (03PS1) 10BBlack: rename varnish backends more-explicitly [puppet] - 10https://gerrit.wikimedia.org/r/230687 [22:44:16] (03PS2) 10Dzahn: etherpad: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230686 [22:46:23] subbu: better? [22:46:56] (03CR) 10BryanDavis: [C: 04-1] "Putting the -1 on here that godog should have added for "wow will this crush statsd?"." [puppet] - 10https://gerrit.wikimedia.org/r/230233 (https://phabricator.wikimedia.org/T100735) (owner: 10BryanDavis) [22:46:56] the jenkins jobs haven't completed .. but checked zuul and yes, the php parser tests seems to have run to completion and is green. [22:47:35] (03PS3) 10Dzahn: etherpad: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230686 [22:47:43] so, at least that part is better. but, jenkins / zuul seems like it is not making a lot of progress. [22:48:30] PROBLEM - puppet last run on cp3019 is CRITICAL puppet fail [22:48:32] (03PS3) 10Ori.livneh: Build wikiversions.php in addition to wikiversions.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/230679 (https://phabricator.wikimedia.org/T108638) [22:48:36] subbu: :) [22:48:42] well, re: the first the part being better. [22:48:59] :) [22:49:58] (03PS1) 10Ori.livneh: Add wikiversions{-labs}.php to .gitignore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230690 (https://phabricator.wikimedia.org/T108638) [22:50:06] bblack, should https://gerrit.wikimedia.org/r/#/c/230535/ be merged? [22:50:15] (03CR) 10Ori.livneh: [C: 032] Add wikiversions{-labs}.php to .gitignore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230690 (https://phabricator.wikimedia.org/T108638) (owner: 10Ori.livneh) [22:50:21] (03Merged) 10jenkins-bot: Add wikiversions{-labs}.php to .gitignore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230690 (https://phabricator.wikimedia.org/T108638) (owner: 10Ori.livneh) [22:52:11] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [22:53:15] yurik: ask otto, I have no idea what's involved in the merging and deploying of a change to analytics/refinery [22:53:26] (03PS1) 10Dzahn: kibana: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230692 [22:54:06] yurik: sorry, i hopefully can help more tomorrow [22:54:16] in the middle of a migration that is UMm only kinda ok. [22:54:36] ottomata, no worries, just needed the status for our meeting tomorrow :) [22:54:44] RECOVERY - puppet last run on cp3019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:56:15] (03PS1) 10Dzahn: dbtree: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230693 [22:56:59] bblack, yurik, i just logged in to look at one of the maps varnishes, i see varnishkafka there is logging ot webrequest_misc? [22:57:38] ottomata: it's probably a leftover from before the role switch. I wonder if it actually grabs traffic still? [22:57:41] akosiaris ^? [22:58:16] bblack i've lost your change [22:58:21] for the webrequest_maps thing [22:58:49] ah found it [22:58:59] ah naw [22:59:01] this can happen first bblack [22:59:01] https://gerrit.wikimedia.org/r/#/c/230539/ [22:59:03] that's fine [22:59:09] mind if I merge that? [22:59:15] ok [22:59:15] (03PS2) 10BBlack: Add webrequest_maps kafka topic output for cache_maps [puppet] - 10https://gerrit.wikimedia.org/r/230539 (https://phabricator.wikimedia.org/T105076) [22:59:18] I just rebased [22:59:31] k merging [22:59:43] (03CR) 10Ottomata: [C: 032] "Naw, this is ok. Camus is not a blocker." [puppet] - 10https://gerrit.wikimedia.org/r/230539 (https://phabricator.wikimedia.org/T105076) (owner: 10BBlack) [23:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150810T2300). [23:00:04] James_F ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:23] (03CR) 10BryanDavis: [C: 032] "Tested via cherry-pick in beta cluster. wikiversions-labs.php was generated and synced to all hosts. The file was both logically and synta" [tools/scap] - 10https://gerrit.wikimedia.org/r/230679 (https://phabricator.wikimedia.org/T108638) (owner: 10Ori.livneh) [23:00:33] (03PS2) 10Dzahn: dbtree: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230693 [23:00:34] ok [23:00:50] (03CR) 10Ottomata: [V: 032] Add webrequest_maps kafka topic output for cache_maps [puppet] - 10https://gerrit.wikimedia.org/r/230539 (https://phabricator.wikimedia.org/T105076) (owner: 10BBlack) [23:00:52] (03PS2) 10Dzahn: kibana: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230692 [23:00:56] I'll swat then [23:01:09] * ebernhardson waves [23:01:10] bd808: \o/ thanks [23:01:15] * James_F waves. [23:01:32] (03PS1) 10Dzahn: openstack: make dashboard compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230694 [23:02:02] (03PS4) 10Alex Monk: VisualEditor: Switch from …Namespaces to …AvailableNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228040 (owner: 10Jforrester) [23:02:15] yurik: ok, not in hadoop yet [23:02:17] (03CR) 10Alex Monk: [C: 032] VisualEditor: Switch from …Namespaces to …AvailableNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228040 (owner: 10Jforrester) [23:02:18] but you can on stat1002 do [23:02:19] kafkacat -C -b analytics1012.eqiad.wmnet:9092 -t webrequest_maps [23:02:24] (03Merged) 10jenkins-bot: VisualEditor: Switch from …Namespaces to …AvailableNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228040 (owner: 10Jforrester) [23:02:27] or use kafkatee in a similar way with a config file and grab it [23:02:39] (03CR) 10Ori.livneh: [V: 032] Build wikiversions.php in addition to wikiversions.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/230679 (https://phabricator.wikimedia.org/T108638) (owner: 10Ori.livneh) [23:02:42] (03PS5) 10Alex Monk: Enable VisualEditor on NS_PROJECT for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228041 (https://phabricator.wikimedia.org/T107003) (owner: 10Jforrester) [23:02:49] (03CR) 10Alex Monk: [C: 032] Enable VisualEditor on NS_PROJECT for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228041 (https://phabricator.wikimedia.org/T107003) (owner: 10Jforrester) [23:02:55] (03Merged) 10jenkins-bot: Enable VisualEditor on NS_PROJECT for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228041 (https://phabricator.wikimedia.org/T107003) (owner: 10Jforrester) [23:03:00] somebody loves "uri_path":"/_info" ! :) [23:03:28] we're using that to monitor service health [23:03:41] standard per gwicke :D [23:04:50] !log deployed scap a404a39b32... Build wikiversions.php in addition to wikiversions.cdb [23:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:37] yurik: per mobrovac, actually ;) [23:07:57] !log krenair@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/228040/ (duration: 00m 14s) [23:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:22] Krenair: Notice: Undefined variable: wmgVisualEditorAvailableNamespaces in /srv/mediawiki/wmf-config/CommonSettings.php on line 2006 [23:08:28] I can see that, thanks ebernhardson [23:09:34] Oh dear. [23:09:35] ebernhardson, I imagine that's in cases where it synchronised CommonSettings before InitialiseSettings [23:09:54] Krenair: It seems to be active on all namespaces until the JS kicks in. [23:10:07] Krenair: E.g. load https://www.mediawiki.org/wiki/Talk:VisualEditor and watch the tabs. [23:10:17] Hm [23:10:25] Is PHP assuming that unset keys go to true rather than false? [23:11:20] wgVisualEditorAvailableNamespaces does not have 1 => true [23:11:26] But the code in VE? [23:11:43] vgVisualEditorAvailableNamespaces doesn't have 1 => false either, right? [23:11:43] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 12s) [23:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:00] Reverted? [23:12:02] no [23:12:24] Oh. [23:12:25] Wait. [23:12:26] What? [23:12:35] foreach ( $wgContentNamespaces as $contentNamespace ) { [23:12:35] if ( !isset( $wgVisualEditorAvailableNamespaces[$contentNamespace] ) ) { [23:12:36] $wgVisualEditorAvailableNamespaces[$contentNamespace] = true; [23:12:37] } [23:12:41] Surely, false? [23:13:01] that should turn it on in all content namespaces by default [23:13:20] Without over-rides. [23:13:23] Eh. [23:13:44] Is onRegistration always going to run after all namespaces are defined? [23:15:20] $wgContentNamespaces? yes, it should [23:15:29] unless someone is doing a weird hack [23:16:27] There's no code in onSkinTemplateNavigation to disable VE's edit tab per NS, except for Education Program and MediaWiki:, and configuring it differently for File: when the page is remote. [23:16:42] Something to fix. [23:16:46] James_F, so we don't actually check the namespace against AvailableNamespaces before setting up the tab from the php hooks [23:16:57] Yeah. [23:16:59] Something to add. [23:17:01] maybe another async rl thing? [23:17:16] Ideally before the train so it's less of an issue. :-) [23:17:17] if we were relying on our JS to kick in and hide that tab before firstpaint before... [23:17:33] Krenair: It was an issue before this change, then. [23:19:07] James_F, I reverted the changes on tin and synced them to mw1017 [23:19:09] (03PS1) 10Faidon Liambotis: Addd forgotten A record for mr1-ulsfo.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/230695 [23:19:11] (03PS1) 10Faidon Liambotis: Add A/PTR for mr1-codfw and msw1-codfw [dns] - 10https://gerrit.wikimedia.org/r/230696 [23:19:22] turned on the wikimedia-debug extension in chrome and got the page back from mw1017 [23:19:38] it still shows the edit tab flashing and then disappearing on a mw.org talk page [23:19:40] Krenair: Doesn't fix it. https://test.wikipedia.org/wiki/Talk:Fooo [23:19:42] Yeah. [23:19:52] So let's go with this and fix it now. [23:20:02] (03CR) 10Faidon Liambotis: [C: 032] Addd forgotten A record for mr1-ulsfo.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/230695 (owner: 10Faidon Liambotis) [23:20:22] (03PS2) 10Faidon Liambotis: Add forgotten A record for mr1-ulsfo.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/230695 [23:20:24] (03PS2) 10Faidon Liambotis: Add A/PTR for mr1-codfw and msw1-codfw [dns] - 10https://gerrit.wikimedia.org/r/230696 [23:20:47] James_F, so this change seems fine, shall we do TMH + ebernhardson's search commit before fixing the existing ve issue? [23:21:01] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1526084 (10csteipp) >>! In T108227#1523852, @Ottomata wrote: > If @csteipp wants access to webrequest logs in Hive, he will need to be in the analytics-privatedata-users group. Y... [23:21:11] jenkins gone? [23:22:12] Krenair: Go for it. [23:22:17] Krenair: Writing something now. [23:22:20] ok [23:23:17] anyone from releng around? [23:27:51] 6operations, 10Citoid, 6Security, 6Security-Team, and 2 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#1526118 (10mobrovac) Hopefully we will be able to turn off `parsoidcache` soon(TM), which effectively means `citoid.wikimedia.org` would disappear. Ins... [23:28:40] (03PS1) 10Ottomata: Send 0.8.2.1 Kafka metrics via jmxtrans to graphite/ganglia [puppet] - 10https://gerrit.wikimedia.org/r/230699 [23:28:58] (03PS1) 10Tim Starling: Deploy ParsoidBatchAPI on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230700 [23:29:19] greg-g: jenkins jobs for DNS are not running; not sure who from releng is supposed to handle it while hashar is away [23:30:21] (03PS2) 10Ottomata: Send 0.8.2.1 Kafka metrics via jmxtrans to graphite/ganglia [puppet] - 10https://gerrit.wikimedia.org/r/230699 [23:30:32] 6operations, 10Citoid, 6Security, 6Security-Team, and 2 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#1526127 (10BBlack) @mobrovac: the code used to redirect the main sites can be enabled for parsoidcache with the flick of a virtual switch. We just did... [23:31:22] 6operations, 10Citoid, 6Security, 6Security-Team, and 2 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#1526130 (10Jdforrester-WMF) >>! In T108632#1526118, @mobrovac wrote: > Hopefully we will be able to turn off `parsoidcache` soon(TM), which effectively... [23:32:43] (03PS1) 10Ori.livneh: Convert multiversion scripts to use wikiversions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230702 (https://phabricator.wikimedia.org/T108638) [23:33:31] (03CR) 10Ori.livneh: [C: 032] Convert multiversion scripts to use wikiversions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230702 (https://phabricator.wikimedia.org/T108638) (owner: 10Ori.livneh) [23:33:37] (03Merged) 10jenkins-bot: Convert multiversion scripts to use wikiversions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230702 (https://phabricator.wikimedia.org/T108638) (owner: 10Ori.livneh) [23:33:41] paravoid: which jobs? [23:33:55] operations/dns ones [23:33:57] e.g. https://gerrit.wikimedia.org/r/#/c/230695/ [23:34:02] there are many queued jobs at the moment [23:34:16] these are usually super quick, they run on their own slave afaik [23:34:48] Wtf? [23:34:58] https://integration.wikimedia.org/zuul/ zuul is backed up [23:35:10] (03CR) 10Ottomata: [C: 032 V: 032] Send 0.8.2.1 Kafka metrics via jmxtrans to graphite/ganglia [puppet] - 10https://gerrit.wikimedia.org/r/230699 (owner: 10Ottomata) [23:35:17] Who is Adam Wight on IRC? [23:35:29] Krenair: awight [23:36:17] in -dev and no other channels.. [23:36:28] 6operations, 10Citoid, 6Security, 6Security-Team, and 2 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#1526149 (10BBlack) So long as the mapping is trivial (e.g. `https://citoid.wikimedia.org/foo/bar` becomes `https://en.wikipedia.org/api/v1/citoid/foo/b... [23:36:29] Krenair: no other channels that you are in [23:36:45] awight: Krenair try #wikimedia-fundraising [23:36:52] (03PS1) 10Ori.livneh: Fix-up for I511999137b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230703 [23:37:02] (03CR) 10Ori.livneh: [C: 032] Fix-up for I511999137b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230703 (owner: 10Ori.livneh) [23:37:08] (03Merged) 10jenkins-bot: Fix-up for I511999137b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230703 (owner: 10Ori.livneh) [23:37:42] (03PS2) 10Tim Starling: Deploy ParsoidBatchAPI on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230700 [23:37:44] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526151 (10awight) It seems like wasted effort to increase the length limit beyond 2,000 chars. See... [23:37:51] 6operations, 10Citoid, 6Security, 6Security-Team, and 2 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#1526152 (10Jdforrester-WMF) >>! In T108632#1526149, @BBlack wrote: > So long as the mapping is trivial (e.g. `https://citoid.wikimedia.org/foo/bar` bec... [23:37:58] He made https://gerrit.wikimedia.org/r/#/c/230657/ and hasn't actually deployed it [23:38:08] (03CR) 10Legoktm: [C: 031] Deploy ParsoidBatchAPI on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230700 (owner: 10Tim Starling) [23:38:21] paravoid: they appear to be queued as well. afaict, gearman isn't stuck. we just have a ton of backed up jobs atm [23:38:27] alright, tahnks [23:38:29] Krenair: if its about DonationInterface it basically doesn't matter. that doesn't run in prod but having the package deployed does something (i forget what) [23:38:57] It is about DonationInterface [23:39:09] Krenair: its always safe to just update DonationInterface [23:39:13] Krenair: it's on a dev branch isn't it? [23:40:31] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1526158 (10RobH) [23:40:45] !log ori@tin Synchronized multiversion/MWMultiVersion.php: I511999: Convert multiversion scripts to use wikiversions.php (duration: 00m 12s) [23:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:46] ebernhardson, so we don't need to actually sync it? [23:42:16] bd808, no? [23:42:21] it's on the deployment branch [23:42:31] presumably the E:DonationInterface equivalent of wmf/* [23:42:39] Krenair: right [23:44:03] (03PS2) 10BBlack: rename varnish backends more-explicitly [puppet] - 10https://gerrit.wikimedia.org/r/230687 [23:44:05] (03CR) 10Tim Starling: [C: 032] Deploy ParsoidBatchAPI on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230700 (owner: 10Tim Starling) [23:44:31] (03Merged) 10jenkins-bot: Deploy ParsoidBatchAPI on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230700 (owner: 10Tim Starling) [23:44:31] !log krenair@tin Synchronized php-1.26wmf17/extensions/TimedMediaHandler/TimedMediaIframeOutput.php: https://gerrit.wikimedia.org/r/#/c/230656/ (duration: 00m 12s) [23:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:02] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible (or $deploy_system in general)? - https://phabricator.wikimedia.org/T107532#1526208 (10mobrovac) >>! In T107532#1525074, @GWicke wrote: > How can we move this forward? It is perhaps... [23:47:15] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 3.39% of data above the critical threshold [1000.0] [23:47:30] James_F, thedj: I don't think the TMH patch has fixed it fully [23:47:39] Boo. [23:47:41] (03CR) 10BBlack: [C: 04-1] "Hmmm, this would also need a compatible change to the directors template for confd" [puppet] - 10https://gerrit.wikimedia.org/r/230687 (owner: 10BBlack) [23:47:51] Oh, hang on [23:48:09] better with debug=true [23:48:37] Ha. [23:48:43] Everything's better with that. [23:49:31] Although I'm not sure if it was working with that before [23:51:29] ebernhardson, want to do https://gerrit.wikimedia.org/r/#/c/229424/1 ? [23:51:46] 6operations, 10Wikimedia-General-or-Unknown: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1526212 (10Legoktm) 3NEW [23:51:53] Krenair: sure [23:51:59] (03CR) 10EBernhardson: [C: 032] Limit the number of states generated by a wildcard query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229424 (https://phabricator.wikimedia.org/T102589) (owner: 10DCausse) [23:52:03] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1526221 (10RobH) We've linked up with Arul @ EQ for all the above. The two wave xconnects are pending (per the links in the task description). These are the only two outstanding items needed before the on-si... [23:52:23] (03Merged) 10jenkins-bot: Limit the number of states generated by a wildcard query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229424 (https://phabricator.wikimedia.org/T102589) (owner: 10DCausse) [23:53:56] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: Limit the number of states in a cirrussearch query (duration: 00m 11s) [23:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:45] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible (or $deploy_system in general)? - https://phabricator.wikimedia.org/T107532#1526241 (10GWicke) Yup: >> I understand that having a definitive answer on whether or not to use ansible... [23:57:43] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526242 (10Tgr) >>! In T91347#1526151, @awight wrote: > It seems like wasted effort to increase the... [23:58:43] (03PS4) 10Dzahn: Grafana: Apache 2.3+ compatibility [puppet] - 10https://gerrit.wikimedia.org/r/230682 (owner: 10Ori.livneh)