[00:01:17] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1629101 (10Dzahn) [00:02:39] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [00:06:57] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1629114 (10Dzahn) Does it really have to be a Gerrit upgrade and can't be a change to a puppetized config file as we would do with regular sshd? [00:07:06] (03PS1) 10Yuvipanda: k8s: Use insecure port for controller / scheduler [puppet] - 10https://gerrit.wikimedia.org/r/237543 [00:07:28] (03PS2) 10Yuvipanda: k8s: Use insecure port for controller / scheduler [puppet] - 10https://gerrit.wikimedia.org/r/237543 [00:10:07] (03CR) 10Yuvipanda: [C: 032] k8s: Use insecure port for controller / scheduler [puppet] - 10https://gerrit.wikimedia.org/r/237543 (owner: 10Yuvipanda) [00:10:12] (03CR) 10Alex Monk: "krenair@tin:/srv/mediawiki-staging (master)$ mwscript namespaceDupes.php pnbwiki | grep "وکیپیڈیا" -c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226543 (owner: 10Amire80) [00:26:08] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1629179 (10Dzahn) re: this suggested [[ https://code.google.com/p/gerrit/issues/detail?id=3517 | solution ]] to delete/replace [[ https://en.wikipedia.org/wiki... [00:38:35] (03PS2) 10Dzahn: wdqs: set icinga contact group on node [puppet] - 10https://gerrit.wikimedia.org/r/237535 [00:49:54] (03PS1) 10Yuvipanda: k8s: Use https for kubelet and kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/237553 [00:50:13] (03PS2) 10Yuvipanda: k8s: Use https for kubelet and kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/237553 [00:50:21] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Use https for kubelet and kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/237553 (owner: 10Yuvipanda) [00:58:59] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1629234 (10Tgr) Backports that still need merging (I tested them all): https://gerrit.wikimedia.org/r/#/q/Ibde59be61a5b3d7cd5... [00:59:21] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1629237 (10Tgr) [00:59:32] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1377088 (10Tgr) [01:03:35] ostriches: can you merge the bunch of backports from https://phabricator.wikimedia.org/T102566#1629234 ? they are the last thing blocking Commons from being HTTPS-only [01:03:53] well, they and a set of new releases [01:10:11] (03CR) 10Dzahn: [C: 032] wdqs: set icinga contact group on node [puppet] - 10https://gerrit.wikimedia.org/r/237535 (owner: 10Dzahn) [01:10:28] (03PS3) 10Dzahn: wdqs: set icinga contact group on node [puppet] - 10https://gerrit.wikimedia.org/r/237535 [01:16:25] !log ori@tin Synchronized php-1.26wmf22/extensions/TitleBlacklist: 9bf13dbe0b, 3203b045f7 (duration: 00m 12s) [01:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:19:30] (03PS1) 10Yuvipanda: k8s: Setup kubeconfig file for kubelet and kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/237559 [01:19:38] (03CR) 10jenkins-bot: [V: 04-1] k8s: Setup kubeconfig file for kubelet and kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/237559 (owner: 10Yuvipanda) [01:19:53] (03PS2) 10Yuvipanda: k8s: Setup kubeconfig file for kubelet and kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/237559 [01:20:09] Do we have docs on server-side uploads somewhere? [01:20:50] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 1 below the confidence bounds [01:20:55] do you mean the upload-by-url feature? [01:21:41] (03CR) 10Yuvipanda: [C: 032] k8s: Setup kubeconfig file for kubelet and kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/237559 (owner: 10Yuvipanda) [01:25:47] tgr, no, uploads by sysadmins [01:29:54] there is https://wikitech.wikimedia.org/wiki/Uploading_large_files if that counts as documentation [01:30:18] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 308159 MB (10% inode=99%) [01:30:26] Aka 'yes, ish'. :-) [01:32:13] tgr, no [01:34:32] Krenair: "Upload to terbium then use that command". [01:35:09] Yeah I would, but not enough space on terbium [01:35:18] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301725 MB (10% inode=99%) [01:35:54] Requested files are in a 46GB archive, terbium only has 25GB free on / [01:36:01] there's supposedly 16 files in there [01:36:05] I could ask them to split it up [01:36:46] Hm.. I thought we had a wildcard cert for toolserver.org? [01:37:06] On Dutch Wikipedia I'm getting SSL errors when using the map widget [01:37:06] https://nl.wikipedia.org/wiki/Amsterdam [01:37:10] GET https://b.toolserver.org/tiles/osm-no-labels/12/2103/1346.png net::ERR_INSECURE_RESPONSE [01:37:13] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1629293 (10Alexsh) Looks like git 2.5.1 start to use OpenSSH 7.0. Just update my git to lastest in Win10 and XP, same result before I add KexAlgorithms to .ssh... [01:38:06] and they redirect to HTTP instead of HTTPS [01:40:18] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300313 MB (10% inode=99%) [01:41:29] Pinged https://phabricator.wikimedia.org/T103272 [01:45:18] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 298589 MB (10% inode=99%) [01:50:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 294498 MB (10% inode=99%) [01:52:15] mutante, around? [01:55:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 298080 MB (10% inode=99%) [02:00:09] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 294498 MB (10% inode=99%) [02:04:30] PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: puppet fail [02:05:09] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291032 MB (10% inode=99%) [02:07:58] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:08:29] (03PS1) 10Tim Landscheidt: nagios_common: Delay default evaluation of template() [puppet] - 10https://gerrit.wikimedia.org/r/237561 (https://phabricator.wikimedia.org/T111982) [02:09:41] (03PS2) 10Tim Landscheidt: nagios_common: Delay default evaluation of template() [puppet] - 10https://gerrit.wikimedia.org/r/237561 (https://phabricator.wikimedia.org/T111982) [02:09:59] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10770 bytes in 0.137 second response time [02:10:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [02:14:20] (03CR) 10Tim Landscheidt: "I tested this for a shinken instance where it worked, and I tried to emulate the calling pattern for Icinga by including the class with no" [puppet] - 10https://gerrit.wikimedia.org/r/237561 (https://phabricator.wikimedia.org/T111982) (owner: 10Tim Landscheidt) [02:15:09] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [02:20:09] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [02:20:54] (03PS1) 10Yuvipanda: k8s: Use the puppet certificates for authentication [puppet] - 10https://gerrit.wikimedia.org/r/237562 [02:21:01] (03CR) 10jenkins-bot: [V: 04-1] k8s: Use the puppet certificates for authentication [puppet] - 10https://gerrit.wikimedia.org/r/237562 (owner: 10Yuvipanda) [02:21:07] (03PS2) 10Yuvipanda: k8s: Use the puppet certificates for authentication [puppet] - 10https://gerrit.wikimedia.org/r/237562 [02:22:32] (03CR) 10Yuvipanda: [C: 032] k8s: Use the puppet certificates for authentication [puppet] - 10https://gerrit.wikimedia.org/r/237562 (owner: 10Yuvipanda) [02:23:29] 7Puppet, 6operations: Need to run postgresql::user twice to set the password - https://phabricator.wikimedia.org/T112228#1629359 (10Tgr) 3NEW [02:23:55] 7Puppet, 6operations: Need to run postgresql::user twice to set the password - https://phabricator.wikimedia.org/T112228#1629367 (10Tgr) [02:23:56] 7Blocked-on-Operations, 7Puppet, 6Reading-Infrastructure-Team, 10Sentry, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1629366 (10Tgr) [02:24:06] (03PS1) 10Yuvipanda: k8s: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/237563 [02:25:08] (03CR) 10Yuvipanda: [C: 032] k8s: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/237563 (owner: 10Yuvipanda) [02:25:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [02:30:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [02:31:20] RECOVERY - puppet last run on mc2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:32:35] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: fermium needs to have exim4-daemon-heavy installed, not -light - https://phabricator.wikimedia.org/T112229#1629376 (10Dzahn) 3NEW a:3Dzahn [02:34:49] !log l10nupdate@tin Synchronized php-1.26wmf22/cache/l10n: l10nupdate for 1.26wmf22 (duration: 11m 18s) [02:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [02:40:09] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [02:41:25] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf22) at 2015-09-11 02:41:24+00:00 [02:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [02:50:09] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [02:51:31] (03PS1) 10Gergő Tisza: Ensure correct order in postgresql::user [puppet] - 10https://gerrit.wikimedia.org/r/237565 (https://phabricator.wikimedia.org/T112228) [02:54:40] (03PS30) 10Gergő Tisza: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [02:55:09] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [03:00:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [03:00:25] (03PS1) 10Yuvipanda: k8s: Explicitly specify the apiserver on the commandline [puppet] - 10https://gerrit.wikimedia.org/r/237566 [03:05:09] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [03:05:38] (03CR) 10Yuvipanda: [C: 032] k8s: Explicitly specify the apiserver on the commandline [puppet] - 10https://gerrit.wikimedia.org/r/237566 (owner: 10Yuvipanda) [03:10:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [03:15:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [03:16:08] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [03:20:09] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [03:25:09] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [03:28:22] (03PS1) 10Yuvipanda: ores: Turn on redis keepalive [puppet] - 10https://gerrit.wikimedia.org/r/237567 [03:28:30] akosiaris: ^ backup4001 blew up [03:28:35] halfak: ^ [03:28:55] \o/ [03:28:59] * halfak crosses fingers [03:29:11] halfak: now to see if that works. this lets us set it in one place only instead of on all clients [03:29:17] (03CR) 10Yuvipanda: [C: 032] ores: Turn on redis keepalive [puppet] - 10https://gerrit.wikimedia.org/r/237567 (owner: 10Yuvipanda) [03:29:29] Worker 01? [03:29:46] no, on the redis server [03:29:50] Oh! [03:29:50] the server does the keepalive now [03:30:09] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [03:30:52] The post says that you "enable the OS to answer TCP keepalive packets" [03:31:08] Can the server just tell the clinets "you will answer my TCP keepalive packets now" [03:31:11] ? [03:31:19] YuviPanda, ^ [03:32:00] halfak: so the redis server sends the keepalive packets now [03:32:13] Gotcha. So it wasn't before. [03:32:21] halfak: https://github.com/redis/redis-rb/issues/258 [03:32:26] halfak: no, and that's a crazy default [03:32:48] halfak: worker-01 is procesing tasks again and didn't need a restart [03:33:03] Woot! [03:33:19] halfak: let's see if this sticks! [03:33:26] halfak: wheeee, though. I hope it does stick :) [03:33:35] +1 Thanks for the help. [03:33:45] I'm going to head out for a bit. I'll check back in an hour or so to see if it's OK. [03:33:51] Should I leave it if it goes down? [03:33:54] halfak: ok! if it dies again please ping me [03:33:56] kk [03:33:58] Will do [03:34:03] o/ [03:34:04] * YuviPanda buys halfak a beer [03:34:15] Not yet. It has to work first. [03:34:20] :) [03:34:23] I suppose I had a good lead (hopefully) [03:34:37] halfak: I'm still happy we got a bit more proof. we found a connection hanging, caught it red handed... [03:35:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [03:35:47] * halfak goes to set up his hammock for a cold night test. [03:40:08] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%) [03:44:39] ACKNOWLEDGEMENT - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874112 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 291544 MB (10% inode=99%): Yuvi Panda Acking until alex wakes up - The acknowledgement expires at: 2015-09-12 09:43:52. [03:51:48] (03PS1) 10Yuvipanda: k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/237568 [03:53:12] (03CR) 10Yuvipanda: [C: 032] k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/237568 (owner: 10Yuvipanda) [04:13:38] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 6 below the confidence bounds [04:25:49] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1629452 (10MaxSem) >>! In T112025#1629179, @Dzahn wrote: > re: this suggested [[ https://code.google.com/p/gerrit/issues/detail?id=3517 | solution ]] to delete... [04:30:53] (03PS1) 10Yuvipanda: k8s: Setup ssl certificates properly [puppet] - 10https://gerrit.wikimedia.org/r/237573 [04:31:49] (03CR) 10jenkins-bot: [V: 04-1] k8s: Setup ssl certificates properly [puppet] - 10https://gerrit.wikimedia.org/r/237573 (owner: 10Yuvipanda) [04:32:25] (03PS2) 10Yuvipanda: k8s: Setup ssl certificates properly [puppet] - 10https://gerrit.wikimedia.org/r/237573 [04:33:29] (03CR) 10Yuvipanda: [C: 032] k8s: Setup ssl certificates properly [puppet] - 10https://gerrit.wikimedia.org/r/237573 (owner: 10Yuvipanda) [04:34:09] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds [04:34:52] (03PS1) 10Yuvipanda: k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/237574 [04:35:53] (03CR) 10Yuvipanda: [C: 032] k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/237574 (owner: 10Yuvipanda) [04:36:00] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [04:36:42] YuviPanda: um, I got a shinken alert for integration at 5:55pm... :/ [04:36:51] i'll check 1085 [04:36:59] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [04:37:02] or not [04:38:59] legoktm: oh yaah, puppet's broken on the shinken host [04:39:01] there's a patch to fix it [04:39:04] I should do that soon [04:39:04] (03PS1) 10Dzahn: admin: optimized yuvipanda resource [puppet] - 10https://gerrit.wikimedia.org/r/237575 [04:39:47] (03CR) 10jenkins-bot: [V: 04-1] admin: optimized yuvipanda resource [puppet] - 10https://gerrit.wikimedia.org/r/237575 (owner: 10Dzahn) [04:40:07] YuviPanda: so I'm only pretend removed from shinken? [04:47:03] (03PS1) 10Dzahn: mailman: move scripts to /usr/local/sbin/ [puppet] - 10https://gerrit.wikimedia.org/r/237577 [04:49:14] (03PS2) 10Dzahn: mailman: move scripts to /usr/local/sbin/ [puppet] - 10https://gerrit.wikimedia.org/r/237577 [04:49:27] (03CR) 10Dzahn: [C: 032] mailman: move scripts to /usr/local/sbin/ [puppet] - 10https://gerrit.wikimedia.org/r/237577 (owner: 10Dzahn) [04:50:13] YuviPanda, looks like worker-01 is still processing tasks. *and* I used ORES to revert some vandalism. :) [04:50:21] halfak: \o/ [04:50:23] great [04:50:38] * halfak --> bed [04:50:39] o/ [04:51:01] halfak: night [04:51:44] (03PS1) 10Yuvipanda: k8s: Switch to running everything as a kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/237578 [04:52:03] (03PS2) 10Yuvipanda: k8s: Switch to running everything as a kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/237578 [04:53:16] (03CR) 10Yuvipanda: [C: 032] k8s: Switch to running everything as a kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/237578 (owner: 10Yuvipanda) [05:10:49] (03PS1) 10Yuvipanda: k8s: Specify scheme for controller and scheduler [puppet] - 10https://gerrit.wikimedia.org/r/237580 [05:11:07] (03PS2) 10Yuvipanda: k8s: Specify scheme for controller and scheduler [puppet] - 10https://gerrit.wikimedia.org/r/237580 [05:12:18] (03CR) 10Yuvipanda: [C: 032] k8s: Specify scheme for controller and scheduler [puppet] - 10https://gerrit.wikimedia.org/r/237580 (owner: 10Yuvipanda) [05:13:56] (03CR) 10Dzahn: [C: 04-2] admin: optimized yuvipanda resource [puppet] - 10https://gerrit.wikimedia.org/r/237575 (owner: 10Dzahn) [05:15:00] (03CR) 10Yuvipanda: [C: 032] admin: optimized yuvipanda resource [puppet] - 10https://gerrit.wikimedia.org/r/237575 (owner: 10Dzahn) [05:16:38] :p i'll show myself out :) [05:19:10] mutante: :D [05:19:23] (03PS1) 10Yuvipanda: k8s: Setup /var/run for kubelet [puppet] - 10https://gerrit.wikimedia.org/r/237581 [05:19:26] (03PS1) 10Yuvipanda: k8s: Run k8s-proxy as root [puppet] - 10https://gerrit.wikimedia.org/r/237582 [05:21:50] (03CR) 10Yuvipanda: [C: 032] k8s: Setup /var/run for kubelet [puppet] - 10https://gerrit.wikimedia.org/r/237581 (owner: 10Yuvipanda) [05:22:03] (03CR) 10Yuvipanda: [C: 032] k8s: Run k8s-proxy as root [puppet] - 10https://gerrit.wikimedia.org/r/237582 (owner: 10Yuvipanda) [05:25:27] (03PS1) 10Yuvipanda: k8s: Add kubernetes user to the docker group [puppet] - 10https://gerrit.wikimedia.org/r/237583 [05:27:29] (03CR) 10Yuvipanda: [C: 032] k8s: Add kubernetes user to the docker group [puppet] - 10https://gerrit.wikimedia.org/r/237583 (owner: 10Yuvipanda) [05:27:59] mutante: hahaha [05:28:13] y7a! [05:29:34] *scnr* [05:30:51] (03PS1) 10Yuvipanda: k8s: Setup /var/lib/kubelet too [puppet] - 10https://gerrit.wikimedia.org/r/237584 [05:31:53] (03CR) 10Yuvipanda: [C: 032] k8s: Setup /var/lib/kubelet too [puppet] - 10https://gerrit.wikimedia.org/r/237584 (owner: 10Yuvipanda) [05:35:45] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to latest 2.8.x (minor version upgrade) - https://phabricator.wikimedia.org/T65847#1629467 (10Florian) I'm not sure, if Differential is really the best code-review tool we could have, but this is another question. [05:53:26] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to latest 2.8.x (minor version upgrade) - https://phabricator.wikimedia.org/T65847#1629485 (10Nemo_bis) >>! In T65847#1628121, @greg wrote: > Reducing priority as the energy spent on code-review tools in the near term (ie: for the next two quarters) will b... [06:00:07] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to latest 2.8.x (minor version upgrade) - https://phabricator.wikimedia.org/T65847#1629492 (10Dzahn) I kind of expected that T112025 raised the priority a bit. [06:03:01] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Sep 11 06:03:00 UTC 2015 (duration 2m 59s) [06:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:10] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail [06:30:48] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:09] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:19] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:59] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:18] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:40] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:18] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 3 failures [06:56:39] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:56:50] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:10] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:29] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:48] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:10] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:49] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:03:39] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [07:04:38] (03PS1) 10Yuvipanda: k8s: Make kubelet run as root as well [puppet] - 10https://gerrit.wikimedia.org/r/237590 [07:06:44] 6operations, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, 5ContentTranslation-Release6, and 4 others: Review and create table for Content Translation - https://phabricator.wikimedia.org/T111317#1629567 (10KartikMistry) 5Open>3Resolved [07:07:22] (03CR) 10Yuvipanda: [C: 032] k8s: Make kubelet run as root as well [puppet] - 10https://gerrit.wikimedia.org/r/237590 (owner: 10Yuvipanda) [07:24:19] 6operations, 6Phabricator: phabricator dump script should use slave db, not master - https://phabricator.wikimedia.org/T112193#1629606 (10jcrespo) p:5Triage>3Normal So, a couple of clarifications: * The dump process is relatively expensive, it took today 2 hours from 2 to 4 am UTC. It doesn't seem to be c... [07:24:41] 6operations, 6Phabricator, 7Database: phabricator dump script should use slave db, not master - https://phabricator.wikimedia.org/T112193#1629608 (10jcrespo) [07:40:49] (03PS4) 10Tobias Gritschacher: phragile: Add role class [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T108803) (owner: 10WMDE-leszek) [08:07:23] 6operations: rsyncd restart unreliable after configuration changes - https://phabricator.wikimedia.org/T112240#1629689 (10MoritzMuehlenhoff) 3NEW [08:39:59] RECOVERY - HTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5730 bytes in 7.823 second response time [08:42:39] 6operations, 10ops-codfw: ms-be2006.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T112242#1629759 (10fgiunchedi) 3NEW [08:46:29] 6operations: rsyncd restart unreliable after configuration changes - https://phabricator.wikimedia.org/T112240#1629767 (10jcrespo) p:5Triage>3Low Thank you, @MoritzMuehlenhoff for the investigation. Despite personal interest, I am, however, going to triage this as low, because despite being a defect a) there... [08:47:11] 6operations: Ferm rules for postgres roles / labsdb - https://phabricator.wikimedia.org/T104960#1629770 (10MoritzMuehlenhoff) 5Open>3Resolved a:3MoritzMuehlenhoff Ferm has been enabled on the labsdb hosts last week. [08:48:50] ACKNOWLEDGEMENT - RAID on ms-be2006 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi T112242 [08:48:58] ACKNOWLEDGEMENT - puppet last run on ms-be2006 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi T112242 [08:50:47] (03CR) 10Filippo Giunchedi: cassandra: install certs and CA from private.git (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237397 (https://phabricator.wikimedia.org/T108953) (owner: 10Filippo Giunchedi) [08:53:56] (03PS2) 10Filippo Giunchedi: cassandra: new class ca_manager [puppet] - 10https://gerrit.wikimedia.org/r/237377 (https://phabricator.wikimedia.org/T108953) [08:55:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "@Marko, I tend to agree on the class name but afaik underscores are preferred to dashes in puppet names" [puppet] - 10https://gerrit.wikimedia.org/r/237377 (https://phabricator.wikimedia.org/T108953) (owner: 10Filippo Giunchedi) [08:56:21] (03PS3) 10Filippo Giunchedi: cassandra: install certs and CA from private.git [puppet] - 10https://gerrit.wikimedia.org/r/237397 (https://phabricator.wikimedia.org/T108953) [08:56:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: install certs and CA from private.git [puppet] - 10https://gerrit.wikimedia.org/r/237397 (https://phabricator.wikimedia.org/T108953) (owner: 10Filippo Giunchedi) [09:07:26] greg-g: backup4001 is FR as it turns out. Badly named host [09:15:29] (03PS1) 10Hashar: package_builder: support distribution name aliases [puppet] - 10https://gerrit.wikimedia.org/r/237604 (https://phabricator.wikimedia.org/T111097) [09:17:17] akosiaris: if you got some spare time to add package_builder support for 'unstable' (in addition to 'sid'). Got a patch at https://gerrit.wikimedia.org/r/#/c/237604/ :-D [09:17:26] akosiaris: using a lame symlink between 'unstable' and 'sid' [09:20:11] hashar: looks ok. I suppose you tested it ? [09:20:38] akosiaris: I did not , good point [09:20:46] let me cherry pick that on the labs puppetmaster [09:21:16] !log starting profiling of phabricator db (db1043). Very low overhead. [09:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:27:09] (03CR) 10Hashar: [C: 031 V: 032] "I have cherry picked it on the integration puppet master:" [puppet] - 10https://gerrit.wikimedia.org/r/237604 (https://phabricator.wikimedia.org/T111097) (owner: 10Hashar) [09:27:19] akosiaris: it works :-D [09:28:22] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1629836 (10jcrespo) I've restarted profiling on the host. [09:31:02] hashar: ok [09:31:36] akosiaris: the aim is to have the job detect the target distribution to build against using 'dpkg-parsechangelog' [09:31:40] and seems lot of packages have 'unstable' [09:31:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] package_builder: support distribution name aliases [puppet] - 10https://gerrit.wikimedia.org/r/237604 (https://phabricator.wikimedia.org/T111097) (owner: 10Hashar) [09:32:04] hashar: ok merging [09:32:10] (03PS2) 10Alexandros Kosiaris: package_builder: support distribution name aliases [puppet] - 10https://gerrit.wikimedia.org/r/237604 (https://phabricator.wikimedia.org/T111097) (owner: 10Hashar) [09:32:15] (03CR) 10Alexandros Kosiaris: [V: 032] package_builder: support distribution name aliases [puppet] - 10https://gerrit.wikimedia.org/r/237604 (https://phabricator.wikimedia.org/T111097) (owner: 10Hashar) [09:32:29] building .deb packages via Jenkins is a pet project of mine [09:32:57] but eventually when folks propose a change to some repo under operations/debs/ , they would end up with a .deb and lintian/piupart reports :} [09:36:39] Hey jynus! around for a quick PM? :) [09:37:35] addshore, sure [09:48:18] (03PS2) 10Hashar: nodepool: sudo rules for contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/235742 (https://phabricator.wikimedia.org/T111374) [09:57:22] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1629871 (10zeljkofilipin) 5stalled>3Open [09:59:42] (03PS3) 10Zfilipin: WIP rubocop: do not run for upstream code [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) [10:00:55] I could use a nodepool start please. We shut it off while labs was being upgraded / designate issue. Should be: ssh root@labnodepool1001.eqiad.wmnet /bin/systemctl start nodepool [10:01:23] hashar, I am duty, just ping me [10:01:37] ah I forgot about the duty thingie :-D [10:02:11] if people do not use it doesn't work :-( (I am only saying this because it is friday) [10:02:32] yeah I should look at the topic [10:02:45] BTW, ssh root@? really? [10:02:57] oh I have no idea how you guys connect to machine nowadays [10:03:02] a sure thing, systemctl needs root :/ [10:03:11] "this kids nowadays" [10:03:30] !log starting nodepool in labnodepool1001 [10:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:03:58] active (running) since Fri 2015-09-11 10:03:42 UTC [10:04:03] (03PS1) 10Muehlenhoff: Enable ferm on snapshot1001 [puppet] - 10https://gerrit.wikimedia.org/r/237615 (https://phabricator.wikimedia.org/T104991) [10:04:17] you know I am joking, do you? [10:04:33] I tend to get misundertand in my tone :-) [10:04:39] jynus: Gracie mille [10:04:41] sorry about that [10:04:53] ahha [10:05:11] sorry I picks things literally. Hard to guess over IRC whether there is a joke going on :} [10:05:19] my fault, usually [10:06:32] (03PS4) 10Zfilipin: rubocop: do not run for upstream code [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) [10:07:11] hashar, hopfully you could get sudo for that soon [10:07:26] yeah [10:07:27] * grazie mille [10:08:08] (03CR) 10Zfilipin: "I have updated rubocop configuration file according to https://phabricator.wikimedia.org/T102020#1625716" [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [10:08:18] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [10:09:39] (03CR) 10Zfilipin: "RuboCop is happy: https://integration.wikimedia.org/ci/job/bundle-rubocop/1020/console" [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [10:11:26] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1629919 (10zeljkofilipin) @akosiaris: Thanks! :) The only thing left to do is reviewing and merging the related commit: [[ https://gerrit.wikimedia.org... [10:17:49] (03CR) 10Hashar: "Can you update the comments? Pointing to Phabricator comments is not that useful :-} Beside that looks fine." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [10:18:59] (03CR) 10Hashar: [C: 04-1] "From PS3 to PS4 you removed the git submodules. Please keep them ignored since on our local machines we most probably have them checked o" [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [10:26:00] (03CR) 10Jcrespo: [C: 04-1] "unless => "SELECT usename FROM pg_shadow WHERE usename='${username}' and passwd='${pwd_hash_sql}'" ?" [puppet] - 10https://gerrit.wikimedia.org/r/237565 (https://phabricator.wikimedia.org/T112228) (owner: 10Gergő Tisza) [10:28:51] (03CR) 10Jcrespo: "Let's wait until Monday, then reevaluate, please." [puppet] - 10https://gerrit.wikimedia.org/r/237513 (https://phabricator.wikimedia.org/T112135) (owner: 10Dzahn) [10:41:56] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1629948 (10hashar) @Cblair91 complained about this issue on #wikimedia-labs [11:02:11] (03CR) 10Zfilipin: rubocop: do not run for upstream code (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [11:03:37] (03PS5) 10Zfilipin: rubocop: do not run for upstream code [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) [11:09:31] (03PS1) 10ArielGlenn: fixes for cert cleaner script for labs [puppet] - 10https://gerrit.wikimedia.org/r/237626 [11:11:05] 6operations, 6Labs, 10Salt: salt does not run reliably for toollabs / labs generally - https://phabricator.wikimedia.org/T99213#1630013 (10ArielGlenn) so the reason that keys don't get deleted from salt via this script when the instance is deleted is that (some of) them stay around in ldap. Is that intentio... [12:10:39] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: puppet fail [12:35:58] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:50:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [12:50:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [12:54:58] 6operations, 10RESTBase, 10RESTBase-Cassandra: rename cassandra test cluster - https://phabricator.wikimedia.org/T112257#1630273 (10fgiunchedi) 3NEW a:3fgiunchedi [12:55:08] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [12:55:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:00] (03PS1) 10Muehlenhoff: Add command list-server-group-members [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/237641 [13:00:08] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:02] (03PS1) 10Filippo Giunchedi: cassandra: adjust test cluster name [puppet] - 10https://gerrit.wikimedia.org/r/237643 (https://phabricator.wikimedia.org/T112257) [13:02:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add command list-server-group-members [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/237641 (owner: 10Muehlenhoff) [13:02:05] (03CR) 10Filippo Giunchedi: [C: 04-2] "not for today" [puppet] - 10https://gerrit.wikimedia.org/r/237643 (https://phabricator.wikimedia.org/T112257) (owner: 10Filippo Giunchedi) [13:05:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [13:06:50] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: rename cassandra test cluster - https://phabricator.wikimedia.org/T112257#1630315 (10mobrovac) I suppose updating the `system.local` table will happen **before** applying the patch? [13:06:52] (03PS2) 10Nemo bis: Add some more redis monitoring metrics to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225292 [13:10:08] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:10:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [13:15:08] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 188 seconds ago with 0 failures [13:27:53] (03CR) 10BBlack: "Yes, but:" [puppet] - 10https://gerrit.wikimedia.org/r/237368 (owner: 10BBlack) [13:37:41] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1630396 (10BBlack) So, now we're pending on merge of those 3 and a new sec release of those versions? [13:41:05] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: rename cassandra test cluster - https://phabricator.wikimedia.org/T112257#1630404 (10Eevans) >>! In T112257#1630315, @mobrovac wrote: > I suppose updating the `system.local` table will happen **before** applying the patch? Based on a cursor... [13:41:34] (03Abandoned) 10BBlack: HTTP/2 alpha patch v2 [software/nginx] (wmf-1.9.3-1-h2) - 10https://gerrit.wikimedia.org/r/230040 (https://phabricator.wikimedia.org/T96848) (owner: 10BBlack) [13:44:29] (03CR) 10Eevans: [C: 031] "When the time comes...LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/237643 (https://phabricator.wikimedia.org/T112257) (owner: 10Filippo Giunchedi) [13:46:13] (03PS1) 10BBlack: HTTP/2 Alpha Patch [software/nginx] (wmf-1.9.4-1-h2) - 10https://gerrit.wikimedia.org/r/237646 [13:46:29] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [13:47:39] RECOVERY - Host mw2031 is UP: PING OK - Packet loss = 0%, RTA = 35.48 ms [13:48:44] (03PS1) 10Filippo Giunchedi: cassandra: enable DC internode encryption for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/237648 (https://phabricator.wikimedia.org/T108953) [13:49:09] (03CR) 10Filippo Giunchedi: [C: 04-2] "not for today" [puppet] - 10https://gerrit.wikimedia.org/r/237648 (https://phabricator.wikimedia.org/T108953) (owner: 10Filippo Giunchedi) [13:56:09] RECOVERY - check_disk on backup4001 is OK: DISK OK - free space: / 874114 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 843746 MB (29% inode=99%) [14:05:48] PROBLEM - DPKG on mc2007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:06:10] PROBLEM - DPKG on mc2009 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:06:19] PROBLEM - DPKG on mc2013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:06:21] PROBLEM - DPKG on mc2010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:06:21] PROBLEM - DPKG on mc2012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:06:28] PROBLEM - DPKG on mc2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:06:38] PROBLEM - DPKG on mc2015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:06:59] PROBLEM - DPKG on mc2014 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:07:08] PROBLEM - DPKG on mc2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:07:09] PROBLEM - DPKG on mc2008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:07:09] PROBLEM - DPKG on mc2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:07:09] PROBLEM - DPKG on mc2011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:07:09] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 41.67% of data above the critical threshold [500.0] [14:07:28] PROBLEM - DPKG on mc2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:07:28] PROBLEM - DPKG on mc2016 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:07:39] PROBLEM - DPKG on mc2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:07:49] PROBLEM - DPKG on mc2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:08:45] ^looking into the mc2 reports [14:10:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 1 below the confidence bounds [14:10:50] RECOVERY - DPKG on mc2015 is OK: All packages OK [14:10:51] 6operations, 7Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#1630516 (10Ottomata) Can we open grafana up to all WMF employees for now then? Analytics team members that are used to using it are now restricted. [14:12:29] RECOVERY - DPKG on mc2013 is OK: All packages OK [14:12:38] what's with the 5xx spike combined with mc probs? [14:12:47] unrelated? [14:13:18] RECOVERY - DPKG on mc2014 is OK: All packages OK [14:13:20] RECOVERY - DPKG on mc2011 is OK: All packages OK [14:13:42] yeah, unrelated [14:13:45] memcache monitoring seems ok [14:14:06] no typical suspects (nutcracker/mc failing) [14:14:17] I found a bug in debdeploy with some postinsts (mc2* is one of the test bed systems) [14:14:25] ah [14:14:31] ah, and it is mc2 [14:14:37] so definitily unrelated [14:14:39] RECOVERY - DPKG on mc2009 is OK: All packages OK [14:14:40] RECOVERY - DPKG on mc2010 is OK: All packages OK [14:14:48] will check logs anyway [14:14:48] RECOVERY - DPKG on mc2012 is OK: All packages OK [14:15:29] RECOVERY - DPKG on mc2008 is OK: All packages OK [14:15:29] RECOVERY - DPKG on mc2006 is OK: All packages OK [14:15:46] ottomata: unless I'm mistake granfa is open to ops,wmf,nda in LDAP. [14:16:00] RECOVERY - DPKG on mc2003 is OK: All packages OK [14:16:09] RECOVERY - DPKG on mc2004 is OK: All packages OK [14:16:10] RECOVERY - DPKG on mc2007 is OK: All packages OK [14:16:49] RECOVERY - DPKG on mc2005 is OK: All packages OK [14:17:38] RECOVERY - DPKG on mc2002 is OK: All packages OK [14:17:50] RECOVERY - DPKG on mc2001 is OK: All packages OK [14:21:29] (03CR) 10Mobrovac: Add config deployment (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/235385 (owner: 10Thcipriani) [14:23:58] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:26:47] (03PS1) 10Andrew Bogott: Labvirt1002 to Kilo [puppet] - 10https://gerrit.wikimedia.org/r/237653 [14:27:05] JohnFLewis: Ah, thank you! [14:27:15] it is true, i think there was just some misunderstanding [14:27:54] 6operations, 7Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#1630541 (10Ottomata) Oh sorry, apparently it is. I was being asked by some analytics folks, and I think there was just a misunderstanding. Carry on! [14:29:26] (03CR) 10Muehlenhoff: [C: 04-1] "This needs some more ports." [puppet] - 10https://gerrit.wikimedia.org/r/237335 (owner: 10Muehlenhoff) [14:29:29] (03CR) 10Andrew Bogott: [C: 032 V: 032] Labvirt1002 to Kilo [puppet] - 10https://gerrit.wikimedia.org/r/237653 (owner: 10Andrew Bogott) [14:41:00] (03PS1) 10Ottomata: Add alias for otto hproxy [puppet] - 10https://gerrit.wikimedia.org/r/237655 [14:51:58] (03CR) 10Ottomata: [C: 032] Add alias for otto hproxy [puppet] - 10https://gerrit.wikimedia.org/r/237655 (owner: 10Ottomata) [14:52:16] (03PS1) 10Muehlenhoff: Remove debdeploy postinsts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/237657 [14:53:27] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove debdeploy postinsts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/237657 (owner: 10Muehlenhoff) [14:58:30] (03PS1) 10Dzahn: fermium: back to regular role::lists [puppet] - 10https://gerrit.wikimedia.org/r/237658 [14:59:29] PROBLEM - Host mw1156 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:51] (03PS3) 10Andrew Bogott: nodepool: easily switch to nodepool user [puppet] - 10https://gerrit.wikimedia.org/r/234483 (owner: 10Hashar) [15:00:53] andrewbogott: jynus: could use another nodepool version dump. Not sure whom of you two to annoy about it :} [15:00:54] (03CR) 10Andrew Bogott: [C: 032] nodepool: easily switch to nodepool user [puppet] - 10https://gerrit.wikimedia.org/r/234483 (owner: 10Hashar) [15:01:06] the package is ready "just" need upload to apt + upgrade on labnodepool [15:01:59] (03PS2) 10Dzahn: fermium: back to regular role::lists [puppet] - 10https://gerrit.wikimedia.org/r/237658 [15:02:53] 6operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#1630630 (10faidon) p:5Triage>3Normal We're still getting these. Have these been investigated at all? [15:05:25] (03CR) 10Dzahn: [C: 032] fermium: back to regular role::lists [puppet] - 10https://gerrit.wikimedia.org/r/237658 (owner: 10Dzahn) [15:07:05] (03CR) 10Andrew Bogott: "needs manual rebase." [puppet] - 10https://gerrit.wikimedia.org/r/220308 (https://phabricator.wikimedia.org/T103600) (owner: 10Hashar) [15:07:30] hashar: I can do it… link me to the package? [15:07:57] andrewbogott: https://phabricator.wikimedia.org/T112100 terbium.eqiad.wmnet:/home/hashar/public_html/debs/nodepool-debian-user/ [15:07:57] https://people.wikimedia.org/~hashar/debs/nodepool-debian-user/ [15:08:06] ah what a mess, too many links [15:08:08] terbium.eqiad.wmnet:/home/hashar/public_html/debs/nodepool-debian-user/ [15:08:14] should be for jessie-wikimedia/thirdparty [15:08:46] so... [15:09:05] since I’m confused by all the links, could you just give me a complete path to the actual file so I can scp? [15:11:00] !log swapping pem2 cr2-eqiad [15:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:11] paravoid ^ [15:11:32] awesome [15:12:00] juniper shipped it very fast. the email came in late last night [15:12:23] we may have 4hr for this one [15:13:11] hashar: ^ ? [15:13:37] andrewbogott: yeah sorry. Scp link: terbium.eqiad.wmnet:/home/hashar/public_html/debs/nodepool-debian-user/* [15:13:43] that has the tarball .deb etc [15:15:53] cmjohnson1: seems good [15:16:37] good [15:18:00] greg-g: can I deploy some echo regression fixes in an hour or so? [15:21:57] legoktm: suuuuure [15:24:10] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to latest 2.8.x (minor version upgrade) - https://phabricator.wikimedia.org/T65847#1630703 (10greg) >>! In T65847#1629492, @Dzahn wrote: > I kind of expected that T112025 raised the priority a bit. If we have to, we have to. I would prefer a solution that... [15:24:26] (03PS1) 10BBlack: Don't disable LRO/GRO on jessie LVS hosts [puppet] - 10https://gerrit.wikimedia.org/r/237667 (https://phabricator.wikimedia.org/T110530) [15:24:38] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to latest 2.8.x (minor version upgrade) - https://phabricator.wikimedia.org/T65847#1630711 (10greg) p:5Low>3Normal Normal prio until we figure out the ssh issue. [15:26:54] (03CR) 10BBlack: [C: 032] Don't disable LRO/GRO on jessie LVS hosts [puppet] - 10https://gerrit.wikimedia.org/r/237667 (https://phabricator.wikimedia.org/T110530) (owner: 10BBlack) [15:35:20] PROBLEM - mailman_ctl on fermium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [15:35:40] PROBLEM - mailman_qrunner on fermium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [15:37:21] PROBLEM - Exim SMTP on fermium is CRITICAL: Connection refused [15:37:40] PROBLEM - HTTPS on fermium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [15:38:22] PROBLEM - mailman archives on fermium is CRITICAL: Connection refused [15:38:40] PROBLEM - mailman list info on fermium is CRITICAL: Connection refused [15:39:48] (03PS1) 10Jgreen: create new icinga contact group fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/237678 [15:40:11] 6operations: rsyncd restart unreliable after configuration changes - https://phabricator.wikimedia.org/T112240#1630783 (10bd808) For the MW servers, I don't know how far away "upgrade to jessie" is. I don't think we have tested any of the Apache+HHVM+media conversion+image scaling stack on Jessie yet. This was m... [15:42:22] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [15:44:59] !log enabled LRO+GRO on lvs200[456] (backups). Stopping pybal on lvs200[123] to test... [15:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:24] (03PS2) 10Jgreen: create new icinga contact group fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/237678 [15:47:48] (03CR) 10Jgreen: [C: 032 V: 031] create new icinga contact group fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/237678 (owner: 10Jgreen) [15:48:06] andrewbogott: nodepool package is around. Would need an upgrade now ssh labnodepool1001.eqiad.wmnet apt-get install nodepool [15:49:08] oh yeah, I forgot you couldn’t do that :) [15:49:15] done [15:49:52] PROBLEM - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:50:21] PROBLEM - pybal on lvs2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:50:44] oops, that's me, ignore it ^ [15:51:31] PROBLEM - pybal on lvs2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:52:42] ACKNOWLEDGEMENT - pybal on lvs2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black testing [15:52:42] ACKNOWLEDGEMENT - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black testing [15:52:42] ACKNOWLEDGEMENT - pybal on lvs2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black testing [15:53:02] 7Puppet, 6operations, 5Patch-For-Review: Need to run postgresql::user twice to set the password - https://phabricator.wikimedia.org/T112228#1630819 (10jcrespo) a:3jcrespo [15:53:35] 6operations, 10Fundraising Tech Backlog: Document FR-Tech hosts on wikitech - https://phabricator.wikimedia.org/T112278#1630820 (10greg) 3NEW a:3Jgreen [15:54:23] 7Puppet, 6operations, 5Patch-For-Review: Need to run postgresql::user twice to set the password - https://phabricator.wikimedia.org/T112228#1629359 (10jcrespo) p:5Triage>3Low Low because it is not breaking anything right now until a new deployment. I have written a suggestion of fix avoid the double appl... [15:56:03] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Beta giving Error: 403, Insecure POST Forbidden - https://phabricator.wikimedia.org/T112195#1630844 (10jcrespo) p:5Triage>3Normal Normal as the blocking task- there is no consensus about the right solution. [15:59:42] 6operations, 10ops-codfw: ms-be2006.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T112242#1630848 (10jcrespo) @fgiunchedi, trying to do the triage. I do not see the output of the RAID monitoring in the task summary. I suppose you confirmed that it is a hardware issue and not a kernel/f... [16:02:20] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail [16:02:33] (03PS1) 10Legoktm: Get rid of $wmg hack for MassMessage settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237686 [16:02:35] (03PS1) 10Legoktm: Add $wgMassMessageWikiAliases configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237687 [16:03:25] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Beta giving Error: 403, Insecure POST Forbidden - https://phabricator.wikimedia.org/T112195#1630869 (10BBlack) [16:03:28] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1630870 (10BBlack) [16:03:59] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Beta giving Error: 403, Insecure POST Forbidden - https://phabricator.wikimedia.org/T112195#1628536 (10BBlack) The SSL cert issue is complex, we shouldn't block on this to fix beta here. Something like Alex's local patch is warranted for now, but... [16:04:52] (03PS1) 10Ottomata: Set replace=True for EventLogging MySQL consumer [puppet] - 10https://gerrit.wikimedia.org/r/237688 (https://phabricator.wikimedia.org/T112265) [16:07:45] !log enabled LRO+GRO on lvs200[123], starting pybal there again ([456] testing looks good so far) [16:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:08:22] ottomata: https://phabricator.wikimedia.org/T100678 ? [16:08:45] 6operations, 10ops-eqiad: mw1031 has a bad uplink - https://phabricator.wikimedia.org/T95896#1630894 (10faidon) Ping? [16:09:18] RECOVERY - pybal on lvs2003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [16:09:29] RECOVERY - pybal on lvs2002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [16:09:39] Coren, YuviPanda: two icinga alerts for labstore with "UNKNOWN - Unit has no usable last run information (not a timer?) " [16:09:59] RECOVERY - pybal on lvs2001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [16:11:59] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:12:19] 6operations, 10Wikimedia-General-or-Unknown: Page with no revisions: https://mg.wiktionary.org/wiki/franciu - https://phabricator.wikimedia.org/T112282#1630902 (10matmarex) 3NEW [16:13:20] 6operations, 10netops: download.wikimedia.org is slow from Telecom Italia - https://phabricator.wikimedia.org/T112190#1630918 (10jcrespo) @Vituzzu Yesterday/today in the morning there was network saturation on the dumps host, I think due to nginx being saturated by requests/bots. This calmed later today. Do yo... [16:15:36] !log mw1031 rebooting for f/w update [16:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:05] 6operations: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1630937 (10jcrespo) @Krenair I do not fully understand what is the suggested actionable, should the script be removed (because it is unppupetized and not in use) or updated (and/or puppetized) ? [16:18:00] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1630941 (10jcrespo) a:3jcrespo [16:18:08] (03PS2) 10Andrew Bogott: Added puppetmaster test for catchpoint. [puppet] - 10https://gerrit.wikimedia.org/r/235632 (https://phabricator.wikimedia.org/T107456) [16:18:14] 6operations: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1630945 (10Krenair) I'd prefer it to be updated by adding scap::scripts to the host directly. mwscript can occasionally be useful there (it's the host we use to test code without affecting users - runs test.wikipedia.org... [16:18:59] PROBLEM - Host mw1031 is DOWN: PING CRITICAL - Packet loss = 100% [16:19:23] (03PS3) 10Andrew Bogott: Added puppetmaster test for catchpoint. [puppet] - 10https://gerrit.wikimedia.org/r/235632 (https://phabricator.wikimedia.org/T107456) [16:20:07] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1630955 (10mmodell) Can anyone comment on how websockets fit into the current plan? Does that need to get broken out into a separate... [16:22:55] (03PS1) 10Faidon Liambotis: Remove cr1-esams/cr2-knams address from mgmt1-esams [dns] - 10https://gerrit.wikimedia.org/r/237695 [16:23:51] (03PS2) 10Faidon Liambotis: Remove cr1-esams/cr2-knams address from mgmt1-esams [dns] - 10https://gerrit.wikimedia.org/r/237695 [16:23:59] (03CR) 10Faidon Liambotis: [C: 032] Remove cr1-esams/cr2-knams address from mgmt1-esams [dns] - 10https://gerrit.wikimedia.org/r/237695 (owner: 10Faidon Liambotis) [16:24:19] RECOVERY - Host mw1031 is UP: PING OK - Packet loss = 0%, RTA = 2.05 ms [16:25:59] Coren: can you take a look at the alerts paravoid pointed out? [16:26:04] * YuviPanda is going to have breakfast [16:27:02] 6operations, 10Fundraising Tech Backlog: Document FR-Tech hosts on wikitech - https://phabricator.wikimedia.org/T112278#1630972 (10greg) [16:28:04] 6operations, 10ops-eqiad: mw1031 has a bad uplink - https://phabricator.wikimedia.org/T95896#1630984 (10Cmjohnson) The f/w update I had is for BIOS not NIC. On the plus side bios has been updated. but speed is still borked. [16:28:19] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review, 7user-notice: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1630987 (10Amire80) Another one that is probably related: T112285. [16:28:31] (03PS1) 10Faidon Liambotis: Remove mr1-esams second IP on mgmt1-esams [dns] - 10https://gerrit.wikimedia.org/r/237696 [16:28:37] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1630990 (10jcrespo) a:3ArielGlenn Ariel generously accepted to remind the rest of ops about this by owning this. [16:28:58] 7Blocked-on-Operations, 6operations, 10Datasets-General-or-Unknown: Snapshot hosts need to be manually added to dataset1001's exports - https://phabricator.wikimedia.org/T111586#1630993 (10jcrespo) a:3ArielGlenn Ariel generously accepted to remind the rest of ops about this by owning this. [16:29:07] (03CR) 10Faidon Liambotis: [C: 032] Remove mr1-esams second IP on mgmt1-esams [dns] - 10https://gerrit.wikimedia.org/r/237696 (owner: 10Faidon Liambotis) [16:29:15] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1631004 (10greg) @demon: I know I know, but I'd appreciate your take on this :) [16:30:24] YuviPanda: Just back from lunch. Will look at 'em [16:30:26] Krinkle: your change got in the middle of my Echo backports, do you want me to sync it out? [16:30:34] legoktm: Sure [16:31:33] (03PS2) 10BBlack: Don't try to enforce secure POSTs on beta [puppet] - 10https://gerrit.wikimedia.org/r/237523 (https://phabricator.wikimedia.org/T112195) (owner: 10Alex Monk) [16:31:55] 6operations, 10Architecture, 10Incident-20150423-Commons, 10MediaWiki-RfCs, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1631022 (10jcrespo) [16:31:58] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Beta giving Error: 403, Insecure POST Forbidden - https://phabricator.wikimedia.org/T112195#1631023 (10BBlack) Actually, all the other ways to factor this seem uglier. Merging Alex's instead :) [16:32:02] (03CR) 10BBlack: [C: 032 V: 032] Don't try to enforce secure POSTs on beta [puppet] - 10https://gerrit.wikimedia.org/r/237523 (https://phabricator.wikimedia.org/T112195) (owner: 10Alex Monk) [16:32:45] (03PS1) 10Hashar: Continue image refresh if /etc/nodepool exists [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/237699 [16:32:45] !log legoktm@tin Synchronized php-1.26wmf22/resources/src/mediawiki/mediawiki.js: resourceloader: Document internal mw.loader#jobs property (duration: 01m 07s) [16:32:47] !log powercycling mw1156, multiple kernel backtraces in console output [16:32:52] (03PS1) 10Hashar: 0.1.1-wmf4: image creation chokes on /etc/nodepool [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/237700 (https://phabricator.wikimedia.org/T111377) [16:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:13] !log ssh: connect to host mw1156.eqiad.wmnet port 22: Connection timed out [16:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:45] legoktm: see above [16:33:51] (03CR) 10Hashar: [C: 04-2] "Not meant to be merged. Managed by gbp pq." [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/237699 (owner: 10Hashar) [16:33:51] ah, ok [16:34:00] * legoktm waits a bit [16:34:05] 6operations, 10Datasets-General-or-Unknown: download.wikimedia.org is slow from Telecom Italia - https://phabricator.wikimedia.org/T112190#1631027 (10faidon) p:5Triage>3High [16:34:12] 6operations, 10Traffic: Switch codfw caches to tier2, begin pushing some traffic through them to test - https://phabricator.wikimedia.org/T110065#1631030 (10jcrespo) [16:34:23] (03PS1) 10BBlack: Beta secure POST exception: ?i and $-anchoring [puppet] - 10https://gerrit.wikimedia.org/r/237701 [16:34:48] RECOVERY - Host mw1156 is UP: PING OK - Packet loss = 0%, RTA = 1.49 ms [16:34:54] * legoktm re-syncs [16:35:03] 6operations, 10Datasets-General-or-Unknown: download.wikimedia.org is slow from Telecom Italia - https://phabricator.wikimedia.org/T112190#1628418 (10faidon) We got alerts from both Watchmouse and Catchpoint about dumps yesterday. It doesn't look networking-related but more like dumps-related. Adjusting the ta... [16:35:03] !log legoktm@tin Synchronized php-1.26wmf22/resources/src/mediawiki/mediawiki.js: resourceloader: Document internal mw.loader#jobs property (again) (duration: 00m 13s) [16:35:07] legoktm: it's back [16:35:14] yep :) [16:35:17] Krinkle: deployed [16:36:06] legoktm: Oh, the real commit is still awaiting merge [16:36:15] the other one was a dependency [16:36:18] ah [16:36:20] oh well, I'll sync again in a few [16:36:31] 6operations, 10Traffic: Switch codfw caches to tier2, begin pushing some traffic through them to test - https://phabricator.wikimedia.org/T110065#1631047 (10BBlack) 5Open>3Resolved a:3BBlack We're pushing Mexico and several US states' traffic through codfw at this point. There's a little more to do in T... [16:37:03] !log legoktm@tin Synchronized php-1.26wmf22/extensions/Echo/: Echo regression backports (duration: 00m 12s) [16:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:33] (03CR) 10BBlack: [C: 032] Beta secure POST exception: ?i and $-anchoring [puppet] - 10https://gerrit.wikimedia.org/r/237701 (owner: 10BBlack) [16:37:56] YuviPanda: if you like https://gerrit.wikimedia.org/r/#/c/235632/ then I can write some more [16:39:22] (03CR) 10BBlack: [C: 031] Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff) [16:39:35] andrewbogott: lgtm :D merge+Test+add-catchpoint? [16:40:22] (03PS4) 10Andrew Bogott: Added puppetmaster test for catchpoint. [puppet] - 10https://gerrit.wikimedia.org/r/235632 (https://phabricator.wikimedia.org/T107456) [16:40:39] 6operations, 10Traffic, 10Wikimedia-Apache-configuration, 5Patch-For-Review: wikiversity.org and wikinews.org redirects to /503.html - https://phabricator.wikimedia.org/T109226#1631075 (10jcrespo) a:3BBlack BBlack: assigning it nominally to you, even it is (and should be) a group effort just to show that... [16:42:09] (03CR) 10Andrew Bogott: [C: 032] Added puppetmaster test for catchpoint. [puppet] - 10https://gerrit.wikimedia.org/r/235632 (https://phabricator.wikimedia.org/T107456) (owner: 10Andrew Bogott) [16:42:57] andrewbogott, didn't you just did T112100 for hashar on IRC, or is it a separate issue? [16:43:11] !log krinkle@tin Synchronized php-1.26wmf22/resources/src/mediawiki/mediawiki.js: T112232 (duration: 00m 12s) [16:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:37] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Scaling: Upload nodepool_0.1.1-wmf3 to apt.wikimedia.org and upgrade package on labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T112100#1631088 (10Andrew) 5Open>3Resolved a:3Andrew done [16:43:39] jynus: I think I did [16:44:37] YuviPanda: the maps and others backup timers are not active - did you turn them off in the past? [16:45:35] s/are/were/ [16:45:46] I was actually asking only to close it myself, thank you! [16:46:29] RECOVERY - Last backup of the maps filesystem on labstore1002 is OK: OK - Last run successful [16:49:49] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [16:50:15] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Beta giving Error: 403, Insecure POST Forbidden - https://phabricator.wikimedia.org/T112195#1631138 (10Krenair) 5Open>3Resolved a:3Krenair [16:52:18] Coren: I had during the NFS outage but I think I re-enabled them [16:52:24] but maybe I didn't >_> [16:52:32] Afaict, only replicate-tools was active. [16:52:52] Coren: I think I assumed puppet would re-enable them [16:52:53] Coren: https://phabricator.wikimedia.org/T111031 [16:54:53] 6operations, 10ops-codfw, 10netops: cr1-eqdfw PEM 0 failure - https://phabricator.wikimedia.org/T110435#1631151 (10Papaul) The case number for this issue is 2015-0911-0451. Thanks [16:57:58] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video, 5Patch-For-Review: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1631175 (10jcrespo) 5Open>3stalled [16:58:08] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1287289 (10jcrespo) [16:59:35] 6operations, 10Datasets-General-or-Unknown: download.wikimedia.org is slow from Telecom Italia - https://phabricator.wikimedia.org/T112190#1631209 (10Vituzzu) @jcrespo it took about 6:30 hours for a 2.6gb dump (roughly 120kb/s avg speed then) which is significantly slower than usual but definitely faster than... [17:06:48] (03PS1) 10Jcrespo: Add scap scripts to all canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) [17:07:12] (03CR) 10Jcrespo: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [17:07:44] 6operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#1631265 (10fgiunchedi) interestingly since we started monitoring those in june there's been a 30% spike in some, not all (e.g. I sampled two in row D, one had it one didn't) either real or the sensors might have be... [17:07:45] (03CR) 10jenkins-bot: [V: 04-1] Add scap scripts to all canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [17:08:50] 6operations, 5Patch-For-Review: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1631286 (10jcrespo) p:5Triage>3Normal [17:09:07] 6operations, 5Patch-For-Review: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1631291 (10jcrespo) a:3jcrespo [17:11:51] (03PS2) 10Jcrespo: Add scap scripts to all canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) [17:12:23] (03CR) 10Jcrespo: [C: 04-1] Add scap scripts to all canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [17:12:25] 6operations, 10Fundraising Tech Backlog: Document FR-Tech hosts on wikitech - https://phabricator.wikimedia.org/T112278#1631305 (10Jgreen) 5Open>3Resolved Fundraising system/service documentation is on collab wiki because some of it is sensitive. I added a table with host information here: https://collab.w... [17:12:59] (03CR) 10Alex Monk: "I don't actually know what the other 'canary' app servers do." [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [17:13:39] (03CR) 10Jcrespo: "lol" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [17:15:33] (03CR) 10Jcrespo: "Let me block it until someone more in the know can say if it would be useful. We can change it to mw1017 only if that is better. I any cas" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [17:16:10] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [100000000.0] [17:16:46] (03CR) 10Alex Monk: [C: 031] Add $wgMassMessageWikiAliases configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237687 (owner: 10Legoktm) [17:18:37] (03CR) 10Alex Monk: "Wait, what? Why do deployment networks need to be changed just to add mwscript?" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [17:18:53] (03CR) 10Filippo Giunchedi: [C: 04-1] "a few more to add/change, LGTM otherwise" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/225292 (owner: 10Nemo bis) [17:23:43] 6operations, 10Wikimedia-General-or-Unknown: Page with no revisions: https://mg.wiktionary.org/wiki/franciu - https://phabricator.wikimedia.org/T112282#1631369 (10Krenair) There are other pages with this issue [17:24:59] 6operations, 10ops-codfw: ms-be2006.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T112242#1631383 (10fgiunchedi) ah yeah, thanks @jcrespo a failed grep! also reported by icinga with an LD offline ``` Sep 11 17:12:46 ms-be2006 kernel: [2795502.146260] XFS (sdg1): xfs_log_force: error 5... [17:25:25] (03Abandoned) 10MaxSem: Package jetty-runner [debs/jetty-runner] - 10https://gerrit.wikimedia.org/r/204489 (owner: 10MaxSem) [17:26:01] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1631394 (10RobH) [17:27:56] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1631412 (10QChris) >>! In T112025#1623154, @greg wrote: > Based on IRC discussions, this seems to only effect users of the latest OSX, correct? Any other users... [17:28:51] MatmaRex, https://mg.wiktionary.org/wiki/Manokana:Statistika [17:29:20] what about it? [17:29:23] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1631430 (10Paladox) git 2.x on windows too all windows that support the new git update are affected. [17:30:31] 23 million revisions with almost 4 million content pages, with only 4k registered users and 30k non-content pages? [17:31:12] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1631437 (10QChris) >>! In T112025#1623749, @hashar wrote: > I have no idea whether Gerrit comes with a built-in SSH or depends on some system .deb package. Al... [17:33:27] 6operations, 10Datasets-General-or-Unknown: download.wikimedia.org is slow from Telecom Italia - https://phabricator.wikimedia.org/T112190#1631456 (10jcrespo) @faidon sorry for being inexact on my wording (I didn't add the netops tag), with network saturation, I really meant too many requests at application le... [17:34:41] (03CR) 10Nemo bis: Add some more redis monitoring metrics to ganglia (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/225292 (owner: 10Nemo bis) [17:35:21] (03PS3) 10Nemo bis: Add some more redis monitoring metrics to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225292 [17:36:28] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1631468 (10RobH) a:5RobH>3mark I'll allocate wmf4543 for this, pending Mark's approval that this is replicated in labs (just a rubberstamp that he i... [17:37:09] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [17:39:05] (03CR) 10Jcrespo: "No reason, my bad, was thinking about something else. But permissions in general should be checked, I am unsure this could create side eff" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [17:40:58] 6operations, 10ops-codfw: ms-be2006.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T112242#1631480 (10jcrespo) p:5Triage>3Normal a:3Papaul [17:42:52] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1631485 (10devunt) [17:45:11] 6operations, 10Fundraising Tech Backlog: Document FR-Tech hosts on wikitech - https://phabricator.wikimedia.org/T112278#1631493 (10greg) On that page I linked. [17:45:55] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1631494 (10Paladox) An upgrade of gerrit will benefit not just this bug but several others. We just need to pick whch one 2.8.6.6 or 2.11.3. [17:46:38] 6operations, 6Discovery, 10Maps, 5Patch-For-Review: Determine limited maps deployment options - https://phabricator.wikimedia.org/T109159#1631499 (10jcrespo) 5Open>3Resolved a:3jcrespo I think this was determined already by the deployed patches. Reopen if I am wrong. [17:48:24] 6operations, 5Patch-For-Review: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1631506 (10jcrespo) 5Open>3stalled I want to remember there was some disagreement on this issue. [17:48:58] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1631511 (10jcrespo) [17:49:17] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1631512 (10jcrespo) p:5Normal>3Low [17:51:16] 6operations, 5Patch-For-Review: Ferm rules for app servers - https://phabricator.wikimedia.org/T104968#1631519 (10jcrespo) p:5Normal>3High I think this should get a bump. [17:52:26] 6operations: Ferm rules for job runners - https://phabricator.wikimedia.org/T104972#1631524 (10jcrespo) [17:52:54] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1631526 (10Tgr) Bryan just merged them all, so just a new release (this probably does not count as a security issue, as `PhpH... [17:53:53] 6operations: Ferm rules for app servers - https://phabricator.wikimedia.org/T104968#1631530 (10jcrespo) [17:54:58] 6operations, 10Traffic, 7HTTPS: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580#1631535 (10jcrespo) [17:56:14] 7Puppet, 6operations: more verbose hiera messages on failures - https://phabricator.wikimedia.org/T109692#1631537 (10jcrespo) [17:58:22] 6operations, 10Traffic: Support ALPN + HTTP/2 - https://phabricator.wikimedia.org/T96848#1631551 (10jcrespo) [18:00:17] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1631563 (10Jgreen) > Assigning an IP and adding it to our GeoDNS is trivial, so at this point the only blocker is for fr-tech to lift their objection. @... [18:01:18] !log legoktm@tin Synchronized php-1.26wmf22/extensions/Echo/: Echo regression fixes #2 (duration: 00m 12s) [18:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:33] (03PS1) 10Andrew Bogott: Make copy of our instance puppet certs that tools checker can read. [puppet] - 10https://gerrit.wikimedia.org/r/237717 (https://phabricator.wikimedia.org/T107456) [18:01:40] (03CR) 10jenkins-bot: [V: 04-1] Make copy of our instance puppet certs that tools checker can read. [puppet] - 10https://gerrit.wikimedia.org/r/237717 (https://phabricator.wikimedia.org/T107456) (owner: 10Andrew Bogott) [18:01:47] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1631574 (10Revi) User named ##RandomDSdevel## also complained about this on #wikimedia-dev. [18:01:47] 6operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#1631575 (10RobH) a:5RobH>3Papaul I did a quick check, and the humidity is indeed a bit high. We'll need to open a ticket with cyrusone about it. I wanted to type up full instructions on how to check it, so @P... [18:02:20] (03PS2) 10Andrew Bogott: Make copy of our instance puppet certs that tools checker can read. [puppet] - 10https://gerrit.wikimedia.org/r/237717 (https://phabricator.wikimedia.org/T107456) [18:06:43] (03PS3) 10Andrew Bogott: Make copy of our instance puppet certs that tools checker can read. [puppet] - 10https://gerrit.wikimedia.org/r/237717 (https://phabricator.wikimedia.org/T107456) [18:06:48] (03CR) 10Yuvipanda: Make copy of our instance puppet certs that tools checker can read. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237717 (https://phabricator.wikimedia.org/T107456) (owner: 10Andrew Bogott) [18:06:55] (03CR) 10Yuvipanda: [C: 04-1] Make copy of our instance puppet certs that tools checker can read. [puppet] - 10https://gerrit.wikimedia.org/r/237717 (https://phabricator.wikimedia.org/T107456) (owner: 10Andrew Bogott) [18:11:32] 6operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#1631626 (10faidon) a:5Papaul>3RobH Papaul has no access to the cluster yet (cf. T111123 which you opened :)) so he'll be unable to follow your instructions. He also has no LibreNMS account. @RobH, please assist... [18:13:25] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1631642 (10RandomDSdevel) @greg: I'm still on OS X v10.10.5 'Yosemite,' and I'm affected by this problem as well. I don't know whether it's due either to som... [18:14:29] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1631649 (10RandomDSdevel) @Revi: Ack, ya 'ninja'd me! I've got more details for you guys, though. [18:15:16] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1631660 (10BBlack) @Jgreen - Is the payment geography something custom in FR's software, or is it our standard GeoIP cookie stuff we use on the main sites? [18:15:27] (03PS4) 10Andrew Bogott: Make copy of our instance puppet certs that tools checker can read. [puppet] - 10https://gerrit.wikimedia.org/r/237717 (https://phabricator.wikimedia.org/T107456) [18:17:16] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1631679 (10greg) >>! In T112025#1631437, @QChris wrote: > (So as you said further down, it once again comes down to updating Gerrit) @demon is out until monda... [18:17:19] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1631680 (10RobH) @Dzahn: the bastiononly access allows access to the mgmt network. (That may not be a good thing but it does for now ;) So, this task still isnt clear what the out... [18:17:42] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1631685 (10greg) p:5Normal>3High [18:19:24] (03CR) 10Tim Landscheidt: Make copy of our instance puppet certs that tools checker can read. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237717 (https://phabricator.wikimedia.org/T107456) (owner: 10Andrew Bogott) [18:19:38] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1631693 (10RobH) If this does need to wait for Mark, that is fine as well. I'm just following up as I was asked to via T110421 [18:20:01] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1631695 (10Jgreen) It's the standard stuff. [18:22:19] (03PS5) 10Andrew Bogott: Make copy of our instance puppet certs that tools checker can read. [puppet] - 10https://gerrit.wikimedia.org/r/237717 (https://phabricator.wikimedia.org/T107456) [18:27:57] (03CR) 10Andrew Bogott: [C: 032] Make copy of our instance puppet certs that tools checker can read. [puppet] - 10https://gerrit.wikimedia.org/r/237717 (https://phabricator.wikimedia.org/T107456) (owner: 10Andrew Bogott) [18:34:02] (03PS1) 10Andrew Bogott: Fix up a few more python issues with toolschecker.py. [puppet] - 10https://gerrit.wikimedia.org/r/237720 [18:36:09] (03CR) 10Andrew Bogott: [C: 032] Fix up a few more python issues with toolschecker.py. [puppet] - 10https://gerrit.wikimedia.org/r/237720 (owner: 10Andrew Bogott) [18:37:23] (03PS1) 10Hashar: nodepool: send metrics to statsd [puppet] - 10https://gerrit.wikimedia.org/r/237721 [18:38:17] (03CR) 10Hashar: [C: 04-1] "Nodepool is too verbose. We need to limit the stats it sends T111504" [puppet] - 10https://gerrit.wikimedia.org/r/237721 (owner: 10Hashar) [18:43:09] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1631798 (10BBlack) >>! In T73267#1631563, @Jgreen wrote: >> Assigning an IP and adding it to our GeoDNS is trivial, so at this point the only blocker is... [18:46:40] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1631812 (10CCogdill_WMF) @BBlack thanks for the heads up, I think that's fine if fr-tech is OK with it. @MBeat33 should be in the loop since he manages... [19:08:54] 6operations, 10Wikimedia-General-or-Unknown: Page with no revisions: https://mg.wiktionary.org/wiki/franciu - https://phabricator.wikimedia.org/T112282#1631885 (10Krenair) ```mysql> select page_namespace, page_title from page left join revision on (rev_page = page_id) where rev_page is null; +----------------+... [19:10:33] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1631898 (10Krinkle) >>! In T104735#1631506, @jcrespo wrote: > I want to remember there was some disagreement on this issue. What disagreement? [19:13:48] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1631918 (10jcrespo) That having an HTTP page was a good solution or even a desired one (because the lack of TLS). I didn't participate on the discussion, though, not remember it very well, but I wanted to ref... [19:13:57] 6operations, 10Wikimedia-General-or-Unknown: Page with no revisions: https://mg.wiktionary.org/wiki/franciu - https://phabricator.wikimedia.org/T112282#1631919 (10Krenair) ```mysql> select page_title, page_id, page_touched from page where page_namespace = 0 and page_title in ('ficțiune', 'flacără', 'franciu',... [19:14:58] 6operations, 10Wikimedia-General-or-Unknown: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#1631920 (10Krenair) [19:23:12] 6operations, 7Varnish: Configure varnish to use "Unconfigured domain" page for 404 Not Served (instead of generic error) - https://phabricator.wikimedia.org/T112316#1631953 (10Krinkle) 3NEW [19:26:05] (03CR) 10Dzahn: [C: 031] "lgtm, confirmed the default sizes, comment about slabinfo showing 312 bytes in size, ack" [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff) [19:30:23] YuviPanda: thoughts about https://gerrit.wikimedia.org/r/#/c/227466/ ? [19:31:06] mutante: I have no opinions :) [19:32:18] it seemed simple but the comments from chase ... [19:32:40] they are asking for you now :) [19:33:14] I continue to insist I have no opinions. [19:33:24] just because something is on labs doesn't mean I have to have an opinion on it :) [19:33:30] ok, me neither, i dont understand the " cluttering the normal production role namespace " [19:34:55] andrewbogott: http://tools-checker.wmflabs.org/labs-puppetmaster/eqiad seems ok [19:35:01] heh, it wasn't my idea to ask, it's just in response to a ticket comment that mentions you .. /me moves on [19:35:25] andrewbogott: and catchpoint seems ok [19:35:39] YuviPanda: great! I’ll write a couple more before I punch out for the day [19:35:44] andrewbogott: cool! [19:35:59] andrewbogott: for public DNS I think catchpoint has a 'native' DNS check [19:36:52] YuviPanda: seems like the existing tools.wmflabs.org check already effectively checks that... [19:37:09] andrewbogott: hmm yeah, that's true. [19:37:17] andrewbogott: and so do *all* the checks actually [19:37:17] Probably not worth paying for a second test. [19:37:19] since tools-checker [19:37:21] andrewbogott: I agree [19:39:07] 6operations, 10Wikimedia-General-or-Unknown: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#1632049 (10jcrespo) franciu records says page was last touched at 2015-09-11 16:08:11. dbstore1001, that runs with a 24 hour delay doesn't have a revision for that page, and the results of... [19:39:27] ^Good news [19:40:38] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1632053 (10BBlack) There's some confusion here due to the use of "HTTP". This issue isn't about protocol (HTTP vs HTTPS). It's just about whether, if a user were to browse to `https://www.wmfusercontent.org... [19:40:56] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1632057 (10Dzahn) >>! In T111123#1631680, @RobH wrote: > @Dzahn: the bastiononly access allows access to the mgmt network. (That may not be a good thing but it does for now ;) Yea... [19:41:29] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1632058 (10RobH) Oh, he has that from setting it initially on every system. [19:42:12] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1632059 (10BBlack) We can also just wait to do this until after you're done with that email. [19:42:19] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1632060 (10Dzahn) This makes sense that he is the one setting it :) [19:42:27] ^this situation is borderline ridiculous (shame on us) [19:43:46] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1632072 (10CCogdill_WMF) @BBlack emails are most productive in their first 24 hours, so I wouldn't mind waiting until Tuesday, 9/15 if that works for you. [19:45:15] (03PS1) 10Andrew Bogott: Added a check for labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/237735 (https://phabricator.wikimedia.org/T107453) [19:46:08] PROBLEM - puppet last run on mc2016 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [19:48:49] 6operations, 10Wikimedia-General-or-Unknown: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#1632092 (10Krenair) >>! In T112282#1632049, @jcrespo wrote: > We do no have backups from 2012, although maybe someone has a dump... {T26675} is also waiting for potential restoration from... [19:49:10] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1632096 (10Dzahn) >>! In T112025#1631642, @RandomDSdevel wrote: > I don't know whether it's due either to something that's changed server-side or the fact th... [19:49:28] YuviPanda: the same argument applies to https://phabricator.wikimedia.org/T107450 doesn’t it? [19:49:48] andrewbogott: nope, tools-checker isn't using a labs proxy [19:49:53] andrewbogott: neither is tools.wmflabs.org [19:49:55] ah, ok [19:54:00] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1632124 (10BBlack) Ok sounds good. [19:54:28] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1632126 (10hashar) [20:02:03] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1632180 (10RandomDSdevel) @Dzahn: Got 'ya, will do. I'll be keeping an eye on this task, though, so I can remove this hack when things are updated on the ser... [20:02:23] 6operations, 10Wikimedia-General-or-Unknown: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#1632181 (10jcrespo) I checked the logs and a cache invalidation by @Malafaya created the page_touched, I suppose, thinking that no page content was a caching issue, and not the underlying m... [20:03:06] 6operations, 10Datasets-General-or-Unknown: download.wikimedia.org is slow from Telecom Italia - https://phabricator.wikimedia.org/T112190#1632199 (10Dzahn) Then it seems more like a duplicate of T45647 ... [20:05:56] 6operations, 10ops-eqiad: mw1031 has a bad uplink - https://phabricator.wikimedia.org/T95896#1632207 (10Dzahn) Or could be config of the switch port this is on? [20:07:07] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#1632209 (10jcrespo) [20:12:11] (03CR) 10Tim Landscheidt: [C: 04-1] [WIP DO NOT MERGE] toollabs: replace package{} by require_package() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/236616 (owner: 10Merlijn van Deen) [20:13:22] (03CR) 10Dzahn: [C: 031] "i would not be too worried about side effects since this just adds scripts to /usr/local/bin that humans would have to use and we already " [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [20:17:42] 6operations, 10Datasets-General-or-Unknown: download.wikimedia.org is slow from Telecom Italia - https://phabricator.wikimedia.org/T112190#1632240 (10jcrespo) [20:17:44] 6operations, 10Datasets-General-or-Unknown: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1632241 (10jcrespo) [20:20:47] 6operations, 10Datasets-General-or-Unknown: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1632259 (10jcrespo) p:5Normal>3High @ArielGlenn Please look at the new report at T112190, which I think has the same root issues. Unless something has changed on the thrott... [20:22:20] 6operations, 10Datasets-General-or-Unknown: At peak usage, dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1632268 (10jcrespo) [20:28:45] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1632289 (10demon) >>! In T112025#1629114, @Dzahn wrote: > Does it really have to be a Gerrit upgrade and can't be a change to a puppetized config file as we wo... [20:32:40] 6operations, 10Wikimedia-Git-or-Gerrit: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1632313 (10Paladox) 2.11 fixes so much bugs and includes own built in editor would make work load so much easier if we upgrade to that one. And as we are going... [20:36:46] (03CR) 10Alex Monk: "From the task: "it's the host we use to test code without affecting users - runs test.wikipedia.org and other requests with the X-Wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [20:36:52] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: rsync the diff since mail was held on sodium - https://phabricator.wikimedia.org/T110138#1632334 (10Dzahn) since rsync scripts have been changed and use other options, i ran new tests: initial run as dry-run: ``` sent 2038665 bytes received 6328... [20:36:59] 6operations, 10Datasets-General-or-Unknown: At peak usage, dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1632336 (10BBlack) Note the tech press recently ran some articles (e.g. https://thestack.com/cloud/2015/09/09/wikipedia-anne-hathaway-op... [20:38:26] (03CR) 10Dzahn: "i guess my question is why the canary appserver needs the scripts on the host itself while the real appservers don't need them and we just" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [20:41:59] (03CR) 10Alex Monk: "We do not just run scripts on terbium. I was fiddling around with code on mw1017 recently and was surprised to find that mwscript did not " [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [20:55:38] (03CR) 10Aude: "While it's rare that I need to run a script on mw1017, I recently needed to do some profiling of a maintenance script (T109088), and neede" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [20:59:36] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint, 7WorkType-Maintenance: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1632470 (10ksmith) [21:10:54] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Scaling: Upload nodepool_0.1.1-wmf3 to apt.wikimedia.org and upgrade package on labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T112100#1625406 (10hashar) All upgraded properly. Thank you! [21:12:03] 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1632526 (10Jgreen) Ok these hacks make it compile on with automake 1.14 1) modify configure.ac: -AM_INIT_AUTOMAKE +AM_INIT_AUTOMAKE([subdir-objects foreign]) 2... [21:12:47] 6operations, 5Continuous-Integration-Scaling: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1632533 (10hashar) The service is implemented and managed to magically boot and delete an instance. The labs work made by @andrew in spring has been a huge benefit. There is still lot of w... [21:12:57] 6operations, 5Continuous-Integration-Scaling: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1632536 (10hashar) 5stalled>3Resolved a:3hashar [21:15:14] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1632564 (10Krinkle) https://phab.wmfusercontent.org is served with a wildward certificate. So we do have one. And phab.wmfusercontent.org is served from varnish/misc so presumably the certificates is already... [21:17:55] hey mutante, do you know about url-downloader? [21:19:36] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#1632579 (10Malafaya) @jcrespo, indeed I did an //?action=purge// to see if it would solve it, without sucess. [21:20:17] Krenair: i know that it's a http proxy ... [21:20:38] do you know why it might return 403 to certain requests and not others? [21:20:54] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1632583 (10BBlack) >>! In T104735#1632564, @Krinkle wrote: > https://phab.wmfusercontent.org is served with a wildward certificate. So we do have one. And phab.wmfusercontent.org is served from varnish/misc s... [21:21:33] !log shutdown nodepool on labnodepool1001.eqiad.wmnet until monday [21:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:21:41] Krenair: no, unless the difference is the protocol (you have to set it separately for https ) [21:21:56] no [21:23:06] can you see any pattern in the ones you get 403 for ? i use that for example to download planet feeds too and haven't noticed an issue [21:24:34] (03PS1) 10QChris: Make gerrit offer newer key exchange algorithms for new sshs [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) [21:24:58] (03CR) 10QChris: [C: 04-1] "Uploading a jar to puppet is wrong, I know. Hence, CR-1." [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [21:25:53] mutante, http://www.ub.unibas.ch/digi/a100/diverse_projekte/gt1gb2load.tar [21:26:12] Krenair: that's 45 Gigabyte :) [21:26:22] it's quite happy to get the root of that domain though [21:26:22] that might explain [21:26:33] does it not like large files? [21:27:06] i have no proof but a limit wouldn't surprise [21:27:54] does it log anything useful? [21:29:02] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1632613 (10QChris) I verified that even for our old gerrit, adding BouncyCastle is sufficient to make OpenSSH 7 happy again. Hence, we hav... [21:29:13] (03CR) 10Jcrespo: "Where did that jar came from?. Assuming you have the source code for the jar (otherwise, it is a no-go), I would suggest creating a packag" [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [21:32:10] (03CR) 10Jcrespo: [C: 031] Enable ferm for role::mariadb::analytics [puppet] - 10https://gerrit.wikimedia.org/r/235444 (owner: 10Muehlenhoff) [21:32:12] Krenair: there's an access.log [21:32:16] Krenair: TCP_DENIED_REPLY/403 3973 GET http://www.ub.unibas.ch/digi/a100/diverse_projekte/gt1gb2load.tar [21:32:47] so the remote site is actually the one sending the 403? [21:32:51] -Squid-Error: ERR_TOO_BIG [21:32:59] there, too big [21:33:08] ah [21:33:15] (03PS1) 10Alex Monk: Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/237755 [21:33:16] no, it's the squid [21:33:36] 15 maximum_object_size 1010 MB [21:33:57] (03CR) 10Ori.livneh: [C: 032 V: 032] Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/237755 (owner: 10Alex Monk) [21:34:34] thanks ori [21:34:50] the default value for that is 4MB :) [21:34:53] mutante, so it actually doesn't handle 1GB files? great... [21:35:09] legoktm, how do you get large files to the servers? [21:36:01] Krenair: depending on how large they are, I'll wget from terbium directly, or wget on my laptop and rsync to production [21:36:10] dunno why 1010 but not 1024 [21:36:25] full archive is too large for terbium [21:37:02] do it in parts? :| [21:37:37] Krenair: bast1001 might have enough room? dunno if thats ok [21:38:00] I don't have the parts and don't want to send it via my laptop, that'd take forever [21:38:50] Krenair: bast1001 can talk to the internet iirc [21:41:40] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1632670 (10hashar) [21:42:35] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1632673 (10jcrespo) [21:43:36] we don't really want to encourage working from the bastion, but as a stopgap i think it's ok, it has > 300G as long as you delete it again [21:43:42] legoktm, I also checked bast2001 and hooft, but they're tiny. bast1001 looks okay. but then how would we get it to a host that can actually upload it properly? [21:43:49] yes, I'm definitely not planning to leave the file laying around :) [21:44:10] This would've been possible with agent forwarding, but... [21:44:14] hmm [21:44:51] no idea on that :P [21:44:58] I used to scp from bast1001 -> terbium [21:45:04] not since agent forwarding was disabled though [21:45:58] i used rsyncd to copy files without that [21:46:29] I know nothing about rsyncd. Is it possible to do that without being root? [21:47:00] to use it, yes, but not to initially set it up [21:47:07] well, it's done by puppet [21:47:10] ugh [21:47:23] yeah, let's not for a one-off upload request [21:47:33] i was about to ask, that's only if this is a permanent problem [21:48:01] Can we do something via labs? [21:49:08] i think if you had it on labs you wouldn't be able to push it over to prod from thre [21:49:23] well, unless it's on a webserver [21:49:28] temporary labs user, download there, then stick the private key (new key just for the labs user) on terbium? or is normal production completely unable to ssh to labs? [21:51:46] ssh from bsation to bastion works [21:51:57] from bast1001 to a labs bastion? [21:52:00] yea [21:52:17] but that still has the same issue [21:53:15] the file needs to get to a mediawiki host (e.g. terbium, tin) for upload [21:53:27] terbium does not have enough space for the file :p [21:54:18] tin does [21:54:20] yeah, which is why I was planning to do it on tin [21:54:47] again, not ideal, but it should work [22:03:46] mutante, legoktm: so, any ideas? I don't think bast1001 should become an rsyncd host just for server-side uploads... [22:04:22] nope :( [22:04:26] we need a real solution to this [22:05:52] (03CR) 10QChris: "The jar is available from" [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [22:05:54] i think it deserves a ticket [22:06:00] can we download it on labs, split it into parts, make the parts accessible over the web, download them on tin and piece it all back together and extract? [22:06:10] one other idea was just "chunked" downloads [22:06:15] http://stackoverflow.com/questions/12993879/download-large-file-chunk-by-chunk-with-php-curl [22:07:54] fwiw, the rsyncd would be on the target, so tin [22:08:12] terbium in general [22:08:27] terbium has a disk space issue besides all the download issues [22:08:31] it juts wouldnt fit [22:08:31] yeah, but [22:08:58] could download it onto bast1001 and extract the files there, then rsync what fits to terbium and upload, then rm from terbium and do the next batch [22:09:31] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1632777 (10QChris) >>! In T112025#1632289, @demon wrote: > We can try building [...] against the newer jsch. I'd not attempt that. Mina S... [22:09:35] it's not one huge 46GB file to upload, that wouldn't be allowed by FileBackend anyway [22:10:06] just multiple images that exceed the 1GB web upload limit in a huge archive [22:10:19] can we ask the creator of this 46GB file to split it up on the source server? [22:10:36] yes, although it doesn't solve the proxy issue [22:10:50] if they are under 1GB it does, right [22:11:04] If they are under 1GB then they wouldn't be asking for a server-side upload. [22:11:33] server-side upload limit is higher [22:11:33] ok, let's find a permanent solution via a ticket with more input from others [22:13:45] okay [22:15:06] mutante, the uploader splitting them up would not really solve any major issues I don't think [22:15:39] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [22:17:30] Krenair: well, it would make it more reasonable for a deployer to download locally and rsync [22:19:36] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1632825 (10RandomDSdevel) Yeah, sounds like doing something like that could get //ugly!// [22:21:10] 6operations, 10Wikimedia-Site-Requests: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T111941#1632836 (10Krenair) The file is too big for the proxy to handle. It doesn't handle files 1GB or larger. But I shouldn't have to proxy this file via my own laptop to get it onto a m... [22:22:44] 6operations, 10Wikimedia-Site-Requests: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T111941#1620397 (10Krenair) Or can we download it on labs, split it into parts, make the parts accessible over the web, download them on tin and piece it all back together and extract? [22:25:49] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:30:45] 6operations, 10Wikimedia-Site-Requests: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T111941#1632861 (10Krenair) [22:33:41] YuviPanda, any thoughts on ^ ? [22:33:47] particularly my last comment involving labs [22:36:09] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [22:41:14] (03CR) 10Gergő Tisza: "How would I compute pwd_hash_sql?" [puppet] - 10https://gerrit.wikimedia.org/r/237565 (https://phabricator.wikimedia.org/T112228) (owner: 10Gergő Tisza) [22:44:19] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:50:06] (03PS1) 10Ori.livneh: Grafana: allow unauthenticated GET requests [puppet] - 10https://gerrit.wikimedia.org/r/237761 [22:57:10] 6operations, 10Wikimedia-Site-Requests: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T111941#1632903 (10Dzahn) trying to break this down: issue: - we need to download really large files; then upload them on commons status: - usually terbium is used for that. that's also... [23:00:36] AaronSchulz: https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3 [23:03:24] 6operations, 10Wikimedia-Site-Requests: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T111941#1632924 (10MaxSem) Hmm, I downloaded a 29 gigs OSM dump not so long ago without problems with `curl -O -x webproxy.eqiad.wmnet:8080 ` [23:20:34] 6operations, 10Wikimedia-Site-Requests: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T111941#1632985 (10Krenair) webproxy.eqiad.wmnet seems promising. Downloading to tin in a screen called `T111941`. [23:20:34] MaxSem: :) that was nice, we talked about different proxies [23:20:37] and it's the solution [23:20:45] one is limited the other isnt [23:20:51] hehehe [23:21:20] Thanks MaxSem [23:21:32] Next ticket: figure out why we have both url-downloader and webproxy :) [23:22:23] and why bastions allow external downloading [23:23:00] second is pretty easy: to prevent people from going fucking insane ;) [23:24:59] 6operations, 10Wikimedia-Site-Requests: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T111941#1632995 (10Dzahn) ... and the difference is `url-downloader` was used as proxy and is `maximum_object_size 1010 MB` (squid config on chromium). while webproxy.eqiad is on carbon a... [23:32:39] (03CR) 10GWicke: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/237761 (owner: 10Ori.livneh) [23:35:08] (03CR) 10Dzahn: "Thanks for the explanation. It's a +1.5 for me, -0.5 because i don't want to override jcrespo's -1 on his own patch and Friday" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [23:40:09] (03PS1) 10BryanDavis: Backport of D37899: Fix ReflectionClass::getMethods filter [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237860 (https://phabricator.wikimedia.org/T95864) [23:41:12] (03PS1) 10BryanDavis: Backport of D44265: filter_var_array: do not fall back to FILTER_DEFAULT [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237861 (https://phabricator.wikimedia.org/T107677) [23:42:37] (03PS1) 10BryanDavis: Backport of D45165: Limit log message length for unserialize failures [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237862 [23:44:05] (03Abandoned) 10BryanDavis: Backport of D44265: filter_var_array: do not fall back to FILTER_DEFAULT [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237006 (https://phabricator.wikimedia.org/T107677) (owner: 10BryanDavis) [23:44:57] (03Abandoned) 10BryanDavis: Backport of D45165: Limit log message length for unserialize failures [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237007 (owner: 10BryanDavis) [23:48:05] (03PS1) 10Tim Landscheidt: Tools: Accept mail for all submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/237863 (https://phabricator.wikimedia.org/T63484) [23:59:27] (03CR) 10Tim Landscheidt: "I tested this successfully by generating /etc/exim4/exim4.conf on Toolsbeta, amending INSTANCEPROJECT, MAILDOMAIN and primary_hostname to " [puppet] - 10https://gerrit.wikimedia.org/r/237863 (https://phabricator.wikimedia.org/T63484) (owner: 10Tim Landscheidt)