[00:13:34] (03Abandoned) 10Faidon Liambotis: Provision generic StatsD instance on professor.pmtpa.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/83603 (owner: 10Ori.livneh) [00:14:12] (03CR) 10Faidon Liambotis: [C: 04-1] "Why can't we use the hostname?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86744 (owner: 10Hashar) [00:14:24] paravoid: if you're around, +2 on https://gerrit.wikimedia.org/r/#/c/87682/? :) trivial. [00:16:22] do we have a version of nginx that supports spdy? [00:16:37] I don't think so [00:17:19] paravoid: on that particular labs module? [00:17:20] paravoid: yes! [00:17:28] (03CR) 10Faidon Liambotis: [C: 04-1] "I don't mind having HTTPS for this at all -- if anything, it's more "consistent", in the sense that no HTTPS Everywhere exceptions are nee" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84873 (owner: 10Akosiaris) [00:17:29] paravoid: if you look at dynamicproxy, it uses the lua stuff in nginx [00:17:48] paravoid: the version installed there (nginx-extras, 1.4.x) supports spdy/2 too [00:17:56] I checked. [00:18:09] (03CR) 10Faidon Liambotis: [C: 032] Add SPDY support to dynamicproxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/87682 (owner: 10Yuvipanda) [00:18:13] wooo! thanks paravoid [00:18:15] sure, what the hell [00:18:22] heh [00:18:22] if it breaks, it breaks :P [00:18:37] well, if it breaks there's only cscott and analytics using it :P [00:18:48] let me force a run [00:20:18] paravoid: woo, it works! :) http://spdycheck.org/#pinklake.wmflabs.org [00:20:26] ty! [00:20:54] (03CR) 10Faidon Liambotis: "Which are these consumers? What is our timeframe? I'd really like us to not do such hacks in our pipeline, can you imagine changing/adding" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86894 (owner: 10Ottomata) [00:41:32] (03CR) 10MZMcBride: "Thanks for working on this." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87896 (owner: 10Ori.livneh) [00:53:14] (03PS1) 10Springle: depool db1022 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87954 [00:53:54] (03PS2) 10Springle: depool db1022 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87954 [00:54:06] (03CR) 10Springle: [C: 032] depool db1022 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87954 (owner: 10Springle) [00:59:23] !log springle synchronized wmf-config/db-eqiad.php 'depool db1022 for upgrade' [00:59:44] Logged the message, Master [01:15:25] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [01:37:25] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:15] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [01:40:36] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [01:42:36] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.257 second response time [02:04:25] !log LocalisationUpdate failed: git pull of extensions failed [02:04:41] Logged the message, Master [02:10:44] :-( [02:50:33] (03PS1) 10Springle: db1022 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/87993 [02:52:11] (03CR) 10Springle: [C: 032] db1022 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/87993 (owner: 10Springle) [03:19:23] !log xtrabackup clone db43 to db1022 [03:19:39] Logged the message, Master [04:56:44] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:35] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [05:59:28] (03PS1) 10Springle: sub db1015 into s6 during slave upgrades [operations/puppet] - 10https://gerrit.wikimedia.org/r/88001 [06:00:30] (03CR) 10Springle: [C: 032] sub db1015 into s6 during slave upgrades [operations/puppet] - 10https://gerrit.wikimedia.org/r/88001 (owner: 10Springle) [06:04:20] RECOVERY - search indices - check lucene status page on search23 is OK: HTTP OK: HTTP/1.1 200 OK - 269 bytes in 0.055 second response time [06:04:50] RECOVERY - search indices - check lucene status page on search22 is OK: HTTP OK: HTTP/1.1 200 OK - 157 bytes in 0.054 second response time [06:30:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [06:31:50] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:32:41] RECOVERY - DPKG on snapshot3 is OK: All packages OK [06:33:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 17.515 second response time [06:36:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [06:37:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 15.408 second response time [06:38:10] (03CR) 10Eloquence: "Agree with previous commenters, specifically CSteipp's comments above as to the limited benefits of retaining HTTPS support. It's fairly s" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84873 (owner: 10Akosiaris) [06:44:50] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:50] RECOVERY - MySQL Slave Running on db1001 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [06:49:50] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:40] PROBLEM - MySQL Slave Delay on db35 is CRITICAL: CRIT replication delay 1598452 seconds [06:50:50] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [06:51:00] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 1597652 seconds [06:51:30] PROBLEM - MySQL Slave Delay on db1001 is CRITICAL: CRIT replication delay 1593905 seconds [06:51:41] RECOVERY - MySQL Slave Delay on db35 is OK: OK replication delay 55 seconds [06:51:41] RECOVERY - DPKG on snapshot3 is OK: All packages OK [06:53:23] (03PS2) 10TTO: Set logo for ukwikisource per community request [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85838 [06:54:40] PROBLEM - MySQL Slave Delay on db35 is CRITICAL: CRIT replication delay 1589895 seconds [06:55:24] (03PS2) 10TTO: Set up rollbacker and filemover groups on hiwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85961 [06:56:21] (03PS2) 10TTO: Change SUL image for loginwiki to WMF logo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86091 [06:57:28] (03PS3) 10TTO: Wnable $wgUseRCPatrol on fawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86093 [06:58:50] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:38] (03PS2) 10TTO: Enable subpages in Programs namespace of metawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86414 [06:59:50] RECOVERY - DPKG on snapshot3 is OK: All packages OK [07:01:48] (03PS2) 10TTO: Clean up wgSiteName in InitialiseSettings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86418 [07:02:50] (03PS2) 10TTO: Miscellaneous cleanup of InitialiseSettings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86415 [07:03:41] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 302 seconds [07:04:20] PROBLEM - MySQL Replication Heartbeat on db35 is CRITICAL: CRIT replication delay 312 seconds [07:04:49] (03PS2) 10TTO: Remove wgSkipSkin and wgSkipSkins [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85645 [07:05:32] (03PS3) 10TTO: Category collation for viwikivoyage to uca-vi [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85419 [07:08:50] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:50] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:50] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [07:13:41] RECOVERY - Disk space on snapshot3 is OK: DISK OK [07:14:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.009 second response time [07:17:50] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:50] PROBLEM - SSH on snapshot3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:41] RECOVERY - SSH on snapshot3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:24:50] PROBLEM - SSH on snapshot3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:25:41] RECOVERY - SSH on snapshot3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:28:50] PROBLEM - SSH on snapshot3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:41] RECOVERY - SSH on snapshot3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:30:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [07:31:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 14.179 second response time [07:33:50] PROBLEM - SSH on snapshot3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:35:41] RECOVERY - SSH on snapshot3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:36:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [07:38:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 27.870 second response time [07:40:26] hello [07:41:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [07:45:50] RECOVERY - Disk space on snapshot3 is OK: DISK OK [07:48:31] (03CR) 10Hashar: "The manifests seems fine to me, but I am probably sure we will not want to use git clone from puppet to deploy the code. The code deploym" [operations/puppet] - 10https://gerrit.wikimedia.org/r/61767 (owner: 10Physikerwelt) [07:48:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 18.578 second response time [07:50:50] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:55:50] PROBLEM - SSH on snapshot3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:56:15] (03CR) 10Akosiaris: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/61767 (owner: 10Physikerwelt) [07:56:49] akosiaris: morning :-] [07:57:11] akosiaris: I think Physikerwelt is being contracted by Wikimedia Deutschland to push Latexml as a webservice [07:57:23] not sure whether anyone from ops is formerly involved in that project though [07:57:29] hashar: good morning to you too [07:58:38] looking at our last meeting notes provides no clue... so maybe not [07:59:52] :-( [08:00:50] RECOVERY - SSH on snapshot3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:01:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [08:04:50] PROBLEM - SSH on snapshot3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:50] RECOVERY - Disk space on snapshot3 is OK: DISK OK [08:09:41] RECOVERY - SSH on snapshot3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:10:50] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:11:50] RECOVERY - Disk space on snapshot3 is OK: DISK OK [08:11:50] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:11:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 27.731 second response time [08:14:50] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:16:35] (03PS1) 10Ori.livneh: Add Icinga check for l10nupdate & drop !log-based alerts [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 [08:17:50] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:50] PROBLEM - SSH on snapshot3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:20:50] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:21:40] RECOVERY - SSH on snapshot3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:23:50] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:50] PROBLEM - SSH on snapshot3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:41] RECOVERY - SSH on snapshot3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:28:50] PROBLEM - SSH on snapshot3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:37] (03CR) 10Akosiaris: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87598 (owner: 10Akosiaris) [08:30:10] PROBLEM - Host db1016 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:40] RECOVERY - Host db1016 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [08:30:45] (03PS2) 10Akosiaris: Cleanup swift monitor_service entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/87598 [08:37:41] RECOVERY - SSH on snapshot3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:37:41] RECOVERY - Disk space on snapshot3 is OK: DISK OK [08:37:50] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:40:50] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:50] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:44:41] RECOVERY - Disk space on snapshot3 is OK: DISK OK [08:44:41] RECOVERY - DPKG on snapshot3 is OK: All packages OK [08:44:41] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:46:59] (03PS2) 10Akosiaris: move check-raid.py from base/files/monitoring/ to nrpe/plugins/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 (owner: 10Dzahn) [08:47:54] (03CR) 10jenkins-bot: [V: 04-1] move check-raid.py from base/files/monitoring/ to nrpe/plugins/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 (owner: 10Dzahn) [08:53:57] (03PS3) 10Akosiaris: move check-raid.py from base/files/monitoring/ to nrpe/plugins/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 (owner: 10Dzahn) [08:54:49] (03CR) 10jenkins-bot: [V: 04-1] move check-raid.py from base/files/monitoring/ to nrpe/plugins/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 (owner: 10Dzahn) [09:05:29] seriously starting to dislike pep8.... [09:06:10] (03PS4) 10Akosiaris: move check-raid.py from base/files/monitoring/ to nrpe/plugins/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 (owner: 10Dzahn) [09:06:54] akosiaris: which text editor are you using? It can most probably be configured to report pep8 / pyflakes error straight in your editing buffer [09:07:18] I already have that [09:07:34] it's just that i have pep8 1.2.1 and gallium has pep 1.4.something [09:07:35] http://i.imgur.com/0KJLqFp.png :D [09:07:40] ohhh [09:07:47] yeah 1.2.1 is quite old [09:08:13] testing has 1.4.5 [09:08:31] our Jenkins has 1.4.6 which is in unstable [09:08:38] (or pip install *evil*) [09:08:56] meh... I was hoping i wouldn't have to start fetching stuff from jessie so soon ... [09:09:32] pep8 is evolving quickly, that made the version provided by stable very obsolete =( [09:10:10] I will probably migrate our Jenkins to use flake8 (a wrapper around pep8 and pyflakes) which has landed in unstable a few days ago [09:11:29] pyflakes ? [09:11:36] haven't used that ever... [09:11:48] it does code analyzis [09:12:01] for example, it will whine if you import a module and do not actually use it [09:12:29] I use vim syntastic which automatically uses pep8 / pyflakes and mccabe whenever they are present [09:12:38] that catch some bugs sometime :-] [09:34:59] (03PS1) 10Springle: warm up db1015 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88025 [09:42:31] (03PS1) 10ArielGlenn: add neverett to stats1002, rt 5886 [operations/puppet] - 10https://gerrit.wikimedia.org/r/88026 [09:42:46] (03CR) 10Springle: [C: 032] warm up db1015 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88025 (owner: 10Springle) [09:44:29] (03CR) 10Faidon Liambotis: [C: 04-2] "I don't like that. check-raid also has dependencies (megacli, arcconf packages) & related resources (sudo). I'd rather prefer having them " [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 (owner: 10Dzahn) [09:44:44] PROBLEM - MySQL Recent Restart on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:44:46] !log springle synchronized wmf-config/db-eqiad.php 'warm up db1015' [09:45:04] Logged the message, Master [09:45:34] RECOVERY - MySQL Recent Restart on db1021 is OK: OK 322 seconds since restart [09:49:39] (03CR) 10ArielGlenn: [C: 032] add neverett to stats1002, rt 5886 [operations/puppet] - 10https://gerrit.wikimedia.org/r/88026 (owner: 10ArielGlenn) [09:54:44] PROBLEM - DPKG on db1015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:55:03] mark, hi, around? [10:00:45] RECOVERY - DPKG on db1015 is OK: All packages OK [10:03:42] (03CR) 10Faidon Liambotis: [C: 032] "Heh, thanks :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87598 (owner: 10Akosiaris) [10:10:20] !log powercycling srv291 [10:10:24] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:10:33] Logged the message, Master [10:11:05] (03CR) 10Faidon Liambotis: [C: 04-1] "(5 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 (owner: 10Ori.livneh) [10:13:24] RECOVERY - Host srv291 is UP: PING OK - Packet loss = 0%, RTA = 26.61 ms [10:14:34] PROBLEM - mysqld processes on db1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [10:14:44] PROBLEM - swift-container-updater on ms-be8 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:14:44] PROBLEM - swift-account-reaper on ms-be8 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:14:44] PROBLEM - swift-container-replicator on ms-be8 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:14:44] PROBLEM - swift-object-auditor on ms-be8 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:14:45] PROBLEM - swift-account-replicator on ms-be8 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:14:45] PROBLEM - swift-object-updater on ms-be8 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:14:45] PROBLEM - swift-container-auditor on ms-be8 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:14:54] PROBLEM - swift-container-server on ms-be8 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:14:55] argh [10:15:00] akosiaris: ^ :) [10:15:04] PROBLEM - swift-account-auditor on ms-be8 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:15:04] PROBLEM - swift-object-replicator on ms-be8 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:15:14] PROBLEM - swift-object-server on ms-be8 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:15:24] PROBLEM - swift-account-server on ms-be8 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:15:31] depoling db1021 [10:15:35] PROBLEM - swift-object-replicator on ms-be1 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:15:35] PROBLEM - swift-container-replicator on ms-be1 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:15:35] PROBLEM - swift-object-auditor on ms-be1 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:15:35] PROBLEM - swift-object-server on ms-be1 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:15:35] PROBLEM - swift-account-auditor on ms-be1 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:15:44] PROBLEM - swift-account-server on ms-be1 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:15:45] PROBLEM - Apache HTTP on srv291 is CRITICAL: Connection refused [10:15:45] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:15:45] PROBLEM - swift-account-replicator on ms-be1 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:15:54] PROBLEM - swift-account-reaper on ms-be1 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:15:54] PROBLEM - swift-container-updater on ms-be1 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:16:01] (03PS1) 10Springle: depool db1021 query storm/crash [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88038 [10:16:04] PROBLEM - swift-object-updater on ms-be1 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:16:14] PROBLEM - swift-container-server on ms-be1 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:16:17] (03CR) 10Springle: [C: 032] depool db1021 query storm/crash [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88038 (owner: 10Springle) [10:16:24] PROBLEM - swift-account-auditor on ms-be4 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:16:33] blergh [10:16:35] PROBLEM - swift-container-server on ms-be4 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:16:35] PROBLEM - swift-account-replicator on ms-be4 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:16:35] PROBLEM - swift-object-updater on ms-be4 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:16:35] PROBLEM - swift-account-server on ms-be4 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:16:44] PROBLEM - swift-object-auditor on ms-be4 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:16:44] PROBLEM - swift-object-replicator on ms-be4 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:16:44] PROBLEM - swift-object-server on ms-be4 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:16:44] PROBLEM - swift-account-reaper on ms-be4 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:16:44] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:16:54] PROBLEM - swift-container-updater on ms-be4 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:17:14] PROBLEM - swift-object-server on ms-be2 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:17:14] PROBLEM - swift-container-replicator on ms-be4 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:17:14] PROBLEM - swift-container-updater on ms-be2 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:17:21] (03CR) 10Springle: [V: 032] depool db1021 query storm/crash [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88038 (owner: 10Springle) [10:17:24] PROBLEM - swift-object-updater on ms-be2 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:17:35] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:17:35] PROBLEM - swift-account-server on ms-be2 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:17:35] PROBLEM - swift-account-replicator on ms-be2 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:17:35] PROBLEM - swift-account-auditor on ms-be2 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:17:35] PROBLEM - swift-account-reaper on ms-be2 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:17:35] PROBLEM - swift-object-auditor on ms-be2 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:17:44] PROBLEM - swift-container-replicator on ms-be2 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:17:44] PROBLEM - swift-container-server on ms-be2 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:17:45] PROBLEM - swift-object-replicator on ms-be2 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:17:45] PROBLEM - swift-account-replicator on ms-be1012 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:17:54] PROBLEM - swift-container-updater on ms-be1012 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:17:54] PROBLEM - swift-account-server on ms-be1012 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:18:04] PROBLEM - swift-container-server on ms-be1012 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:18:04] oh god, it's going to get worse [10:18:04] PROBLEM - swift-object-server on ms-be1012 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:18:07] (03CR) 10Ori.livneh: "> Is it actually non-critical to have a > 26h stale cache?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 (owner: 10Ori.livneh) [10:18:11] !log springle synchronized wmf-config/db-eqiad.php 'depool db1021' [10:18:14] PROBLEM - swift-object-updater on ms-be1012 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:18:24] Logged the message, Master [10:18:24] PROBLEM - swift-account-reaper on ms-be1012 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:18:34] PROBLEM - swift-object-auditor on ms-be1012 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:18:34] PROBLEM - swift-account-auditor on ms-be1012 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:18:35] PROBLEM - swift-object-replicator on ms-be1012 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:18:44] PROBLEM - swift-container-auditor on ms-be1012 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:18:44] PROBLEM - swift-container-replicator on ms-be1012 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:18:53] db1021 just got hit by a query storm, max connections, lock up. wtf [10:19:31] maybe a side effect of swift being in a bad shape? [10:20:05] no [10:20:07] swift is fine [10:20:22] an icinga check is broken [10:20:32] I merged https://gerrit.wikimedia.org/r/87598 just before [10:20:34] PROBLEM - swift-account-reaper on ms-be9 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:20:35] PROBLEM - swift-container-replicator on ms-be9 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:20:35] PROBLEM - swift-account-replicator on ms-be9 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:20:35] PROBLEM - swift-object-auditor on ms-be9 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:20:35] PROBLEM - swift-object-updater on ms-be9 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:20:35] PROBLEM - swift-container-updater on ms-be3 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:20:35] PROBLEM - swift-container-server on ms-be9 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:20:36] PROBLEM - swift-account-server on ms-be3 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:20:44] PROBLEM - swift-object-server on ms-be9 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:20:44] PROBLEM - swift-object-server on ms-be3 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:20:45] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:20:45] PROBLEM - swift-object-replicator on ms-be3 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:20:45] PROBLEM - swift-object-auditor on ms-be3 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:20:45] PROBLEM - swift-account-auditor on ms-be3 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:20:45] PROBLEM - swift-object-updater on ms-be3 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:20:45] PROBLEM - swift-container-replicator on ms-be3 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:20:46] PROBLEM - swift-container-updater on ms-be9 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:20:46] PROBLEM - swift-object-replicator on ms-be9 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:20:47] PROBLEM - swift-account-replicator on ms-be3 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:20:54] PROBLEM - swift-account-auditor on ms-be9 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:21:14] PROBLEM - swift-container-updater on ms-be11 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:21:14] PROBLEM - swift-container-auditor on ms-be9 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:21:14] PROBLEM - swift-container-server on ms-be3 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:21:14] PROBLEM - swift-account-auditor on ms-be11 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:21:24] PROBLEM - swift-object-replicator on ms-be11 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:21:24] PROBLEM - swift-container-replicator on ms-be11 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:21:24] PROBLEM - swift-account-server on ms-be9 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:21:34] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:21:35] RECOVERY - mysqld processes on db1021 is OK: PROCS OK: 1 process with command name mysqld [10:21:35] PROBLEM - swift-container-auditor on ms-be11 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:21:35] PROBLEM - swift-account-reaper on ms-be11 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:21:35] PROBLEM - swift-container-server on ms-be11 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:21:35] PROBLEM - swift-account-replicator on ms-be11 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:21:35] PROBLEM - swift-object-server on ms-be11 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:21:44] PROBLEM - swift-object-updater on ms-be11 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:21:44] PROBLEM - swift-object-auditor on ms-be11 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:21:44] PROBLEM - swift-account-server on ms-be11 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:21:51] sorry for the spam [10:22:14] RECOVERY - MySQL Processlist on db1021 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [10:22:26] grrrr - versus _ [10:22:29] I puppetd --disable everywhere now until I fix this [10:22:32] let's see why now [10:22:35] PROBLEM - swift-container-updater on ms-be12 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:22:35] PROBLEM - swift-account-replicator on ms-be12 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:22:35] PROBLEM - swift-object-updater on ms-be12 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:22:35] PROBLEM - swift-account-server on ms-be12 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:22:44] PROBLEM - swift-object-server on ms-be12 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:22:44] PROBLEM - swift-account-reaper on ms-be12 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:22:45] PROBLEM - swift-container-replicator on ms-be12 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:22:45] PROBLEM - swift-account-auditor on ms-be12 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:22:45] PROBLEM - swift-container-auditor on ms-be12 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:22:45] PROBLEM - swift-object-auditor on ms-be12 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:22:54] PROBLEM - swift-container-server on ms-be12 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:22:54] PROBLEM - swift-object-replicator on ms-be12 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:23:00] can we please get labstore3 rebooted ? RPC scheduler / NFS went wild again that kills beta + tools labs :-D [10:23:05] paravoid: ^ [10:23:29] wasn't this fixed? [10:23:36] apparently not [10:23:36] it's happened again [10:23:44] I can't find the RT ticket related to that [10:23:45] RECOVERY - Apache HTTP on srv291 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.357 second response time [10:23:47] happens like every 2 weeks [10:23:58] would be awesome to have it stop happening [10:24:04] heh! [10:24:32] aude: iirc the plan is to migrate to another server. [10:24:34] PROBLEM - swift-object-auditor on ms-be1007 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:24:35] PROBLEM - swift-container-server on ms-be1007 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:24:35] PROBLEM - swift-container-updater on ms-be1007 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:24:35] PROBLEM - swift-object-server on ms-be1009 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:24:35] PROBLEM - swift-account-replicator on ms-be1007 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:24:35] PROBLEM - swift-object-replicator on ms-be1007 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:24:35] PROBLEM - swift-object-updater on ms-be1007 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:24:36] PROBLEM - swift-container-replicator on ms-be1007 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:24:37] PROBLEM - swift-object-replicator on ms-be1009 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:24:37] PROBLEM - swift-container-replicator on ms-be1009 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:24:38] PROBLEM - swift-account-server on ms-be1009 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:24:38] PROBLEM - MySQL Slave Delay on db1021 is CRITICAL: CRIT replication delay 386 seconds [10:24:39] PROBLEM - swift-account-reaper on ms-be1009 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:24:39] PROBLEM - swift-account-auditor on ms-be1009 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:24:39] hashar: ok [10:24:40] PROBLEM - swift-container-server on ms-be1009 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:24:40] PROBLEM - swift-account-reaper on ms-be1007 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:24:41] PROBLEM - swift-container-auditor on ms-be1007 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:24:44] PROBLEM - swift-object-updater on ms-be1009 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:24:44] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: CRIT replication delay 359 seconds [10:24:45] PROBLEM - swift-account-replicator on ms-be1009 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:24:45] PROBLEM - swift-object-auditor on ms-be1009 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:24:54] PROBLEM - swift-account-auditor on ms-be1007 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:24:57] paravoid: it's a state thing i think. puppet hasn't yet run on neon [10:25:04] PROBLEM - swift-account-server on ms-be1007 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:25:10] running it now [10:25:14] PROBLEM - swift-object-server on ms-be1007 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:25:14] PROBLEM - swift-container-auditor on ms-be1009 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:25:16] you mean the definitions are now check-swift-account-server [10:25:24] er, check_swift-account-server even [10:25:24] PROBLEM - swift-container-updater on ms-be1009 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:25:27] yes [10:25:30] right [10:25:44] RECOVERY - MySQL Replication Heartbeat on db1021 is OK: OK replication delay 106 seconds [10:25:56] !log rebooting labstore3, NFS lockupo [10:26:30] :> [10:26:34] RECOVERY - MySQL Slave Delay on db1021 is OK: OK replication delay 0 seconds [10:26:52] paravoid: don't forget to run start-nfs afterwards [10:26:57] bath the morebot is dead [10:26:59] I've always complained about how we run a 3.8 kernel there :) [10:27:08] morebots: ping [10:27:19] it will come back up after the nfs reboot [10:27:22] or should [10:27:24] PROBLEM - swift-account-replicator on ms-be10 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:27:24] PROBLEM - swift-container-auditor on ms-be10 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:27:27] ahh of course [10:27:35] PROBLEM - swift-object-auditor on ms-be10 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:27:35] or it'll need a reboot, who knows [10:27:44] PROBLEM - swift-object-server on ms-be10 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:27:44] PROBLEM - swift-account-auditor on ms-be10 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:27:44] PROBLEM - swift-container-updater on ms-be10 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:27:44] PROBLEM - swift-object-replicator on ms-be10 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:27:44] PROBLEM - swift-object-updater on ms-be10 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:27:45] PROBLEM - swift-account-server on ms-be10 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:27:45] PROBLEM - swift-account-reaper on ms-be10 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:27:45] PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:27:54] PROBLEM - swift-container-server on ms-be10 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:28:24] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [10:29:14] PROBLEM - swift-object-auditor on ms-be1002 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:29:14] PROBLEM - swift-container-replicator on ms-be1002 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:29:14] PROBLEM - swift-account-reaper on ms-be1004 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:29:26] hashar: why can't we use the hostname instead of IP in the zuul/statsd changeset? https://gerrit.wikimedia.org/r/#/c/86744/ [10:29:34] PROBLEM - swift-container-replicator on ms-be1004 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:29:34] PROBLEM - swift-account-auditor on ms-be1002 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:29:35] PROBLEM - swift-account-reaper on ms-be1002 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:29:35] PROBLEM - swift-container-server on ms-be1004 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:29:35] PROBLEM - swift-container-auditor on ms-be1004 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:29:35] PROBLEM - swift-account-server on ms-be1004 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:29:35] PROBLEM - swift-object-server on ms-be1002 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:29:35] PROBLEM - swift-object-replicator on ms-be1004 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:29:36] PROBLEM - swift-object-updater on ms-be1002 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:29:36] it's so trivial I just want it merged ;-) [10:29:36] PROBLEM - swift-account-replicator on ms-be1004 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:29:37] PROBLEM - swift-object-replicator on ms-be1002 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:29:37] PROBLEM - swift-object-auditor on ms-be1004 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:29:38] PROBLEM - swift-container-updater on ms-be1002 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:29:44] PROBLEM - swift-container-auditor on ms-be1002 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:29:44] PROBLEM - swift-object-updater on ms-be1004 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:29:44] PROBLEM - swift-account-auditor on ms-be1004 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:29:44] PROBLEM - swift-container-updater on ms-be1004 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:29:45] PROBLEM - swift-object-server on ms-be1004 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:29:49] morebots, i thought we had something going. [10:29:54] PROBLEM - swift-account-replicator on ms-be1002 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:29:54] PROBLEM - swift-account-server on ms-be1002 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:30:04] PROBLEM - swift-container-server on ms-be1002 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:30:09] paravoid: I am not sure how DNS works on a linux machine, I wanted to avoid a DNS lookup whenever Zuul sends a UDP metric. [10:30:14] PROBLEM - swift-account-auditor on ms-be1006 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:30:14] PROBLEM - swift-container-updater on ms-be1006 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:30:14] PROBLEM - swift-object-updater on ms-be1006 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:30:24] PROBLEM - swift-object-auditor on ms-be1006 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:30:24] PROBLEM - swift-object-replicator on ms-be1006 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:30:34] PROBLEM - swift-container-replicator on ms-be1006 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:30:34] PROBLEM - swift-account-server on ms-be1006 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:30:35] PROBLEM - swift-account-reaper on ms-be1006 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:30:35] PROBLEM - swift-container-auditor on ms-be1006 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:30:44] PROBLEM - swift-container-server on ms-be1006 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:30:54] PROBLEM - swift-account-replicator on ms-be1006 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:30:54] PROBLEM - swift-object-server on ms-be1006 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:31:02] grr [10:31:15] puppet is taking forever on neon [10:31:18] :-( [10:31:35] yes, it's naggen/puppet 2.7 -> thousands of SQL queries [10:31:39] hmmmm [10:31:46] among other delays [10:31:52] plus stafford being 100% [10:31:55] in addition to stafford [10:31:56] yes [10:32:05] what if i say puppetd --server=sockpuppet [10:32:22] at least one part of the problem will be eliminated [10:32:24] we could get varnish in front of stafford puppet master :D [10:32:50] lol [10:32:50] I don't think it works [10:32:54] it does [10:32:55] the sockpuppet idea [10:32:59] the varnish idea, I'm just going to ignore :P [10:33:08] though that needs a bunch of hacking :-D [10:33:19] why wouldn't the sockpuppet idea work ? [10:33:32] I think it's going to complain about certificates [10:33:41] another hack I have been told about is to have a puppetmaster on each server and simply rsync/git deploy the manifest to each servers they then ask their local puppetmaster instance [10:33:44] i think i can say --ca_server=stafford [10:33:49] hashar: doesn't work for us [10:33:50] !log springle synchronized wmf-config/db-eqiad.php 'repool db1021, slow warm up' [10:33:55] paravoid: yeah :/ [10:34:03] but varnish might be a possiblity [10:34:13] and the usual idea is to run puppet apply, not a local puppetmaster [10:34:14] or find out whatever takes so long on stafford and optimize / tweak it [10:34:25] paravoid: labstore3 seems to be taking a while to come back up :/ [10:34:34] might be fscking ? [10:34:36] we have a plan, we just lack the time... [10:34:50] paravoid: hi, do know if anyone will be able to help with enabling ESI? haven't heard anything over email or from mark, would be great if we can get it implemented and get rid of cache frag [10:34:51] (multiple boxes/workers) [10:35:00] true [10:35:47] addshore: looking... [10:35:55] yurik: I would go for mark to talk about ESI [10:36:06] oh https://wikitech.wikimedia.org/wiki/Puppet/Performance_investigation ( 2011 ) :-/ [10:36:45] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [10:36:53] :> [10:36:58] \O/ [10:37:01] yurik_: I think you were working with him for this, didn't you? me stepping in will probably make things more complicated communications-wise, let's keep it simple [10:37:22] yurik_: considering how this isn't an emergency but a planned new feature [10:37:23] paravoid: true, just wasn't sure if he was around [10:37:44] or what he's schedule is like [10:37:50] he's not on vacation if that's what you're asking :) [10:38:02] yurik_: and we can most probably have ESI enabled on beta cluster by using some evil hack in puppet :-] [10:38:11] addshore: isn't there a thing that needs to be restarted for labstore3 [10:38:14] ? [10:38:16] yurik_: something like: if $::realm == 'labs' --> $use_esi = true [10:38:22] something has to be poked [10:38:22] I'm doing start-nfs, yes [10:38:25] ok [10:38:31] hashar: that would be great - do you know if anyone can help with that? I don't think i have beta shel [10:38:48] yurik_: mark for the varnish vcl / puppet manifests. [10:38:53] that start-nfs thing is a pain [10:38:57] :/ [10:39:20] it gives me a big fat warning and asks me to confirm that the "other NFS server" is down [10:39:25] which one is the other one? :) [10:39:27] yurik_: the beta cluster uses the same configuration as production. The VCL has features that can be enabled/Disabled by the puppet mnaifests. [10:39:29] aaaaah [10:39:33] !log springle synchronized wmf-config/db-eqiad.php 'repool db1021, slow warm up' [10:39:36] that is the question of the day [10:39:50] last time i supposed labstore4 [10:39:54] PROBLEM - swift-object-auditor on ms-be5 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [10:39:54] aude: we will see, nfsd should come up by itself [10:39:54] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [10:40:01] addshore: ok :) [10:40:04] hashar: yes, but i never worked with beta, not even sure if i can connect there [10:40:05] yurik_: so you would have to add an option in the VCL to let one enable/disable ESI. Whenever the puppet manifest is run on labs, you could then have the ESI feature enabled there. [10:40:07] maybe correctly, maybe incorrectly... but it worked [10:40:14] PROBLEM - swift-object-updater on ms-be5 is CRITICAL: NRPE: Command check_swift_object_updater not defined [10:40:14] PROBLEM - swift-container-replicator on ms-be5 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [10:40:18] ok, done [10:40:31] hashar: but are we talking about beta cluster or lab instances? [10:40:34] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [10:40:35] PROBLEM - swift-account-replicator on ms-be5 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [10:40:35] PROBLEM - swift-object-server on ms-be5 is CRITICAL: NRPE: Command check_swift_object_server not defined [10:40:35] PROBLEM - swift-object-replicator on ms-be5 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [10:40:35] PROBLEM - swift-account-auditor on ms-be5 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [10:40:44] PROBLEM - swift-container-server on ms-be5 is CRITICAL: NRPE: Command check_swift_container_server not defined [10:40:44] PROBLEM - swift-account-server on ms-be5 is CRITICAL: NRPE: Command check_swift_account_server not defined [10:40:45] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: NRPE: Command check_swift_container_updater not defined [10:40:51] problem is, it worked in my tests, but not in production [10:41:20] yurik_: the beta cluster is setup using labs instances :-] the project is deployment-prep and you are already a member / administrator of the project. [10:41:20] so i'm hoping betalabs, being similar enough to prod, will be able to spot the erro [10:41:38] yurik_: exactly. It is as close as production as possible. [10:41:57] I think I brought up testing ESI in labs weeks ago [10:42:01] but you didn't like the idea :) [10:42:18] saying how certain you were that it'd work in production [10:42:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [10:42:56] yurik_: I gave you root access on deployment-prep so you can manually run puppet on those instances. [10:43:00] looks like nfsd just kick started itself :) [10:43:03] cheers paravoid [10:43:17] shouldnt take long for everything to catch up [10:43:26] addshore: paravoid yay! [10:43:32] my tools are back :) [10:43:41] I didn't do anything, just rebooted the box [10:43:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 19.195 second response time [10:43:45] yurik_: you can get a list of the instances having a public address at https://wikitech.wikimedia.org/wiki/Special:NovaAddress most of them are varnish front ends. [10:44:06] (03CR) 10Ori.livneh: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 (owner: 10Ori.livneh) [10:44:09] (03PS2) 10Ori.livneh: Add Icinga check for l10nupdate & drop !log-based alerts [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 [10:44:10] (03PS1) 10ArielGlenn: add mw1072 to dsh groups again since it is back in service [operations/puppet] - 10https://gerrit.wikimedia.org/r/88042 [10:44:14] (03CR) 10Hashar: "Can we keep the !log entry ? Eng folks barely looks at the icinga spam but we surely look at wikitech logs :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 (owner: 10Ori.livneh) [10:44:19] (03CR) 10Hashar: "I filled the IP address to prevent a DNS lookup whenever Zuul sends a metric. I have no clue how well the DNS resolver cache them nor for" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86744 (owner: 10Hashar) [10:44:20] (03PS1) 10Springle: repool db1021, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88043 [10:44:21] (03CR) 10Springle: [C: 032] repool db1021, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88043 (owner: 10Springle) [10:44:22] (03PS3) 10Nemo bis: Add Icinga check for l10nupdate & drop !log-based alerts [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 (owner: 10Ori.livneh) [10:44:24] (03CR) 10ArielGlenn: [C: 032] add mw1072 to dsh groups again since it is back in service [operations/puppet] - 10https://gerrit.wikimedia.org/r/88042 (owner: 10ArielGlenn) [10:44:27] (03CR) 10Faidon Liambotis: [C: 032] zuul: statsd sent to tungsten.eqiad.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/86744 (owner: 10Hashar) [10:44:29] hashar: thx! but could you babysit me through it in about an hour+? Need to step away for a sec, and i'm not that good with puppet runs (i have been manually setting things up) [10:44:30] yurik_: hence deployment-cache-text1.pmtpa.wmflabs is the text varnish and deployment-cache-mobile01.pmtpa.wmflabs is the one serving mobile traffic ( *.m.wikipedia.beta.wmflabs.org [10:45:16] akosiaris: I told ori-l on r88009 that we're moving icinga plugins under /usr/local/lib, is that accurate? [10:45:18] yurik_: I am heading out for lunch rather soonish [10:45:25] yes [10:45:27] hashar: same here - afterwards? [10:45:28] yurik_: will you be around later on? [10:45:35] he's asking me if I'm sure and I thought I should ask before I said "yes" :-) [10:45:44] (03PS2) 10Nemo bis: [CleanChanges] Set $wgCCTrailerFilter to true [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84113 [10:46:01] yurik_: as long as you are not supposed to be sleeping in a bed, we can talk about beta :-] [10:46:43] hashar: i'm in Ukraine :) [10:46:47] midday here [10:47:08] hashar: when's good for you? [10:47:18] in an hour and a half? [10:48:05] (03CR) 10Faidon Liambotis: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 (owner: 10Ori.livneh) [10:48:10] db1021 shows this: http://aerosuidae.net/paste/14/52529047 .. never seen so many inet6 lines. anyone know how/why that happens? [10:48:11] hashar: i'm basically trying to solve https://gerrit.wikimedia.org/r/#/q/owner:%22Mark+Bergsma+%253Cmark%2540wikimedia.org%253E%22+message:esi,n,z [10:48:32] lol [10:48:34] privacy extensions? [10:49:37] root@db1021:~# cat /proc/sys/net/ipv6/conf/eth0/use_tempaddr [10:49:37] 2 [10:49:59] root@db1021:~# ip addr ls |grep -c temporary [10:49:59] 7 [10:50:02] springle: ^ [10:50:24] paravoid: thanks [10:50:33] yurik_: ahhh we also have an instance to try out puppet/vcl changes : deployment-staging-cache-mobile01.pmtpa.wmflabs [10:50:39] * springle researches [10:50:50] (03PS4) 10Ori.livneh: Add Icinga check for l10nupdate & drop !log-based alerts [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 [10:50:55] springle: it's basically an IPv6 "feature" useful for desktops and such [10:51:14] yurik_: that instance let one apply a puppet manifest to it without having to merge it in operations/puppet.git . Then you can do some curl requests locally :-D i am pretty sure mark used it before. [10:51:19] springle: since normally the address is the /64 network + EUI-64 (= MAC address), there was the possibility of tracking you across networks [10:51:30] and knowing who a user is despite them moving into different networks [10:51:46] so they added a while back an extension to the spec where the OS generates random addresses and rotates through them [10:52:12] on Linux this is the use_tempaddr knob and if you look at "ip address list" you see a temporary flag [10:52:41] you'll also see a "deprecated" flag, which means these were temporary addresses that have been rotated and won't be used for new connections anymore [10:52:52] but they're kept in case you used them for some connections etc. [10:53:24] in practice, servers don't need privacy extensions and because they're producing funky outputs like the one above it's best to disable them [10:54:23] paravoid: ok, thank you [10:54:40] I just run "echo 0 > /proc/sys/net/ipv6/conf/eth0/use_tempaddr" [10:54:45] (03CR) 10Ori.livneh: "Hashar: keeping this a special case in the SAL doesn't help bring us forward toward having consistent and well-tempered monitoring in my o" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 (owner: 10Ori.livneh) [10:54:45] the output should be much saner now :) [10:54:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [10:55:16] will that stick across a reboot? [10:55:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.424 second response time [10:55:59] Linux's default is 0, I think Ubuntu defaults to having an entry under /etc/sysctl.d to enable it but we override this [10:56:12] I think the box was last rebooted before this was in place though [10:56:19] hence this effect [10:56:45] ah ok [10:56:54] can I run an apt-get dist-upgrade on the box? [10:57:01] it says 106 updates :) [10:58:12] paravoid: maybe we should just add a cron to reboot that box every 2 weeks just before it usually die ? ;p [10:58:19] *dies [10:58:38] heh [10:58:41] i thin the plan is to migrate the NFS service to labstore4 [10:58:46] but could not find any reference :-( [10:58:47] addshore: "you can't fire me, i quit!" [10:59:31] ori-l: I think there's a more fundamental problem with your change [10:59:47] paravoid: it's still in the pool. i'm investigating why it seems to be hammered while other slaves in S5 aren't. don't dist upgrade just yet [10:59:53] what is it? [11:00:14] there's a class of problems that need to be looked after by deployers and icinga is mostly used by ops :/ [11:01:20] throughout its history the SAL was a record of manual actions taking by real human beings [11:01:21] I wholeheartedly agree that more monitoring (and less crazy bot-to-bot communication) in that area is good [11:01:28] which makes it a very valuable document [11:01:32] (03CR) 10Raimond Spekking: "I am looking every day for the status of the LU script because I run the l10n bot scripts on translatewiki.net on a (most) daily base." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 (owner: 10Ori.livneh) [11:01:56] i think the 'localization update complete' spam doesn't belong there [11:01:57] I just think we have to find a way to make this without having it fall through the cracks [11:02:01] maybe a different contact group? [11:02:12] yeah, I don't know; I see the problem [11:02:19] I just don't think keeping it in the SAL is a good option [11:02:26] no disagreement there [11:02:44] !log gallium : restarting Zuul to enable statsd reporting {{gerrit|86744}} [11:02:54] maybe a nagios dashboard [11:02:54] Logged the message, Master [11:03:05] with only relevant errors [11:03:06] dunno [11:03:06] !log gallium : killed a bunch of stalled jenkins/java threads [11:03:12] I say we keep it open for a bit and get input from more people about how they use it; Raimond chimed in just now [11:03:17] yep [11:03:19] Logged the message, Master [11:03:21] maybe even discuss it in today's meeting [11:03:34] or just bring attention to it [11:03:36] tomorrow's! [11:03:38] * ori-l is in denial [11:04:11] ok, i should go to sleep. see ya [11:04:18] bye [11:06:12] (03CR) 10Hashar: "--- /etc/default/zuul 2013-09-10 00:24:52.000000000 +0000" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86744 (owner: 10Hashar) [11:09:18] notice: Finished catalog run in 921.05 seconds [11:09:20] poor gallium [11:09:24] (03CR) 10Ori.livneh: "We just discussed this on IRC and there seems to be consensus about keeping this patch open and getting more feedback from folks about how" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 (owner: 10Ori.livneh) [11:16:14] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [11:25:13] !log gallium : zuul restarted and apparently starting statsd metrics. [11:25:24] Logged the message, Master [11:30:43] !log new public mailing list for the Wikimania Steering Commitee - wikimania-com [11:30:53] Logged the message, Master [12:01:06] (03PS1) 10Faidon Liambotis: swift: swift-ganglia-report-global-stats for eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/88056 [12:01:07] (03PS1) 10Faidon Liambotis: swift: inline swift::proxy::config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88057 [12:01:08] (03PS1) 10Faidon Liambotis: swift: add statsd support to proxy-server [operations/puppet] - 10https://gerrit.wikimedia.org/r/88058 [12:02:34] (03CR) 10jenkins-bot: [V: 04-1] swift: inline swift::proxy::config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88057 (owner: 10Faidon Liambotis) [12:02:39] damn [12:03:13] (03CR) 10jenkins-bot: [V: 04-1] swift: add statsd support to proxy-server [operations/puppet] - 10https://gerrit.wikimedia.org/r/88058 (owner: 10Faidon Liambotis) [12:03:43] paravoid: i've been asked to document where (app) servers need to be removed when they go down for broken hw, like disk replacing. somehow adding that into "Server_Lifecycle" page .. it's an optional event in a server's life :p [12:03:55] though it's not a decom [12:04:04] that page is now RobH anyway... [12:04:09] ah, ok [12:04:24] I don't think it's used for anything [12:04:27] too verbose, too manual [12:05:15] hmm, it's a long read [12:05:48] (03PS2) 10Faidon Liambotis: swift: inline swift::proxy::config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88057 [12:05:49] (03PS2) 10Faidon Liambotis: swift: add statsd support to proxy-server [operations/puppet] - 10https://gerrit.wikimedia.org/r/88058 [12:06:38] (03CR) 10jenkins-bot: [V: 04-1] swift: inline swift::proxy::config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88057 (owner: 10Faidon Liambotis) [12:06:53] yep, adding the info which dsh groups are the important ones and telling Rob [12:07:15] (03CR) 10jenkins-bot: [V: 04-1] swift: add statsd support to proxy-server [operations/puppet] - 10https://gerrit.wikimedia.org/r/88058 (owner: 10Faidon Liambotis) [12:08:25] just involve cmjohnson I think [12:08:32] it's not like RobH is swapping hardware anymore :) [12:09:56] (03PS3) 10Faidon Liambotis: swift: inline swift::proxy::config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88057 [12:09:57] (03PS3) 10Faidon Liambotis: swift: add statsd support to proxy-server [operations/puppet] - 10https://gerrit.wikimedia.org/r/88058 [12:10:47] true [12:13:47] (03CR) 10Faidon Liambotis: [C: 032] swift: swift-ganglia-report-global-stats for eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/88056 (owner: 10Faidon Liambotis) [12:14:00] (03CR) 10Faidon Liambotis: [C: 032] swift: inline swift::proxy::config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88057 (owner: 10Faidon Liambotis) [12:14:12] (03CR) 10Faidon Liambotis: [C: 032] swift: add statsd support to proxy-server [operations/puppet] - 10https://gerrit.wikimedia.org/r/88058 (owner: 10Faidon Liambotis) [12:18:09] (03CR) 10Hashar: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85669 (owner: 10Hashar) [12:20:43] (03PS2) 10Hashar: ganglia wrapper for py plugins (and add diskstat plugin) [operations/puppet] - 10https://gerrit.wikimedia.org/r/85669 [12:21:20] (03PS1) 10Faidon Liambotis: swift: fixup to statsd commit [operations/puppet] - 10https://gerrit.wikimedia.org/r/88060 [12:21:50] (03CR) 10Faidon Liambotis: [C: 032 V: 032] swift: fixup to statsd commit [operations/puppet] - 10https://gerrit.wikimedia.org/r/88060 (owner: 10Faidon Liambotis) [12:24:57] paravoid: Icinga currently down due to config error, because nrpe_check_swift_ check commands are in services but not defined [12:25:45] did you want them gone [12:26:06] akosiaris is handling it [12:26:09] he mailed ops about it too :) [12:26:28] oops, ehem, i should read mail ok:) [12:26:34] no worries :) [12:31:09] * paravoid stabs stabs stabs puppet [12:31:33] I think we've passed the point of no return [12:31:42] puppet can't run anymore [12:33:55] is it normal that there are so many "salt-master" processes on sockpuppet [12:34:04] no idea [12:37:02] sigh @ puppet .. you mean on neon especially, right [12:37:08] ganglia [12:37:31] arg, nickel [12:37:36] I mean in general... [12:37:47] !log gallium / Zuul : cherry picked a change from OpenStack related to statsd metrics. Our change: {{gerrit|88063}} [12:37:57] stafford in 100% CPU [12:38:03] Logged the message, Master [12:39:17] !log gallium : restarting Zuul. [12:39:28] Logged the message, Master [12:43:07] (03CR) 10Faidon Liambotis: "Agreed on removing them from SAL (bot-to-bot communication always felt crazy to me). However, we need to find a way to bring these to the " [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 (owner: 10Ori.livneh) [12:52:56] (03PS1) 10Faidon Liambotis: swift: fix python syntax error in ganglia stats [operations/puppet] - 10https://gerrit.wikimedia.org/r/88066 [12:53:13] (03CR) 10Faidon Liambotis: [C: 032 V: 032] swift: fix python syntax error in ganglia stats [operations/puppet] - 10https://gerrit.wikimedia.org/r/88066 (owner: 10Faidon Liambotis) [12:54:42] (03PS1) 10Dzahn: delete dsh group "broken_appservers". want to reduce number of useless groups that are edited each time a server is taken down or added back. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88068 [12:56:57] RECOVERY - swift-container-server on ms-be1012 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:56:57] RECOVERY - swift-account-server on ms-be1007 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:56:57] RECOVERY - swift-container-server on ms-be1002 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:56:57] RECOVERY - swift-object-server on ms-be1012 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:56:57] RECOVERY - swift-account-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:57:03] yay [12:57:40] !log brought icinga back up, after manually running puppet on neon [12:57:40] :) [12:57:53] Logged the message, Master [12:58:24] ottomata: hey, how's openjdk? [12:58:55] (03PS1) 10Dzahn: this was once created most likely by me when using upgrade-helper script to create dsh groups from parsing nagios config as the reliable source. but it doesn't make sense to keep an outdated version in git. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88069 [12:59:07] RECOVERY - Puppet freshness on ms-fe1001 is OK: puppet ran at Mon Oct 7 12:59:04 UTC 2013 [12:59:21] mutante: http://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines :) [12:59:37] PROBLEM - Apache HTTP on mw72 is CRITICAL: Connection refused [13:00:11] paravoid: heh, ok, ok:) no more than _50_? ooh [13:01:10] first line a short description, then empty line, then long description [13:01:53] yea, i just forgot the first line, amending [13:02:18] (03PS2) 10Dzahn: delete dsh group 'nagios' [operations/puppet] - 10https://gerrit.wikimedia.org/r/88069 [13:03:34] hashar: ping [13:03:46] (03PS2) 10Dzahn: delete dsh group "broken_appservers" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88068 [13:15:40] (03CR) 10Akosiaris: "Since there seems to be a consensus that HTTPS is indeed offering something here and it is not costing us something I will be dropping thi" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84873 (owner: 10Akosiaris) [13:15:46] (03Abandoned) 10Akosiaris: Disable HTTPS on etherpad.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/84873 (owner: 10Akosiaris) [13:23:02] (03PS1) 10Dzahn: delete dsh group "mediawiki-installation-precise" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88071 [13:23:03] (03PS1) 10Odder: (bug 53904) Point local sidebars to UploadWizard on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88072 [13:30:51] hashar: around? [13:31:08] yurik_: yup [13:31:11] yei [13:31:14] have a sec? [13:31:21] yeah yeah sure [13:31:25] forgot about you, sorry :( [13:31:31] i knew it! [13:32:00] * yurik_ status is "severely depressed" [13:32:20] could you walk me through customizing beta with a custom puppet [13:32:22] pls [13:34:04] yurik_: sure [13:34:22] yurik_: is ESI for mobile caches ? [13:34:27] yep [13:35:06] so the instance deployment-staging-cache-mobile01.pmtpa.wmflabs is configured with the role::cache::mobile puppet class [13:35:11] it is not serving any traffic [13:35:34] and it has a puppetmaster::self class applied to it [13:35:42] which mean the instance can receive puppet hacks :-] [13:36:08] https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [13:36:11] that is the long doc [13:36:22] basically that means you can fetch a Gerrit patch in /var/lib/git/operations/puppet [13:36:37] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.749 second response time [13:36:40] and whenever you run puppet (using: puppetd -tv) it will run that state of operations/puppet [13:36:52] basically that let you hack a manifest in labs [13:37:01] hashar: so if i need to push https://gerrit.wikimedia.org/r/#/c/87328/1/templates/varnish/mobile-frontend.inc.vcl.erb [13:37:08] then since the instance has varnish being applied to it, you can use some curl request to test out your change [13:37:39] you would ssh to deployment-staging-cache-mobile01.pmtpa.wmflabs [13:37:42] become root: sudo -s [13:37:57] then fetch the patch in /var/lib/git/operations/puppet using the command listed in Gerrit gui [13:38:23] something like: git fetch https://gerrit.wikimedia.org/r/operations/puppet refs/changes/28/87328/1 && git checkout FETCH_HEAD [13:38:28] I am upgrading the instance [13:39:17] thanks, connecting to it... [13:39:36] I am upgrading the instance right now [13:39:46] but yeah come there [13:40:26] varnish still running \O/ [13:40:54] mark has some patch there from 10 weeks ago [13:40:56] apparently [13:41:05] will rebase [13:42:49] resetted it to latest version [13:44:45] yurik_: fwiw, I'm working on a varnish role for vagrant :) https://gerrit.wikimedia.org/r/#/c/87623/ [13:44:47] wip [13:44:48] hashar: cann't ssh :( [13:45:51] YuviPanda: yep, saw that, haven't tested it - as long as i can choose which server to access ( apache or varnish ) via different ports, its all good. The only real question which one should be 80 [13:46:04] yurik_: well, with vagrant nothing's on 80 :) [13:46:10] yurik_: do you have configured your ssh client to use proxy commands ? [13:46:14] yurik_: varnish via 8080 and apache via 8081 [13:46:18] Could not find class passwords::puppet::database [13:46:19] bah [13:46:20] hashar: ssh gets pubkey error from bastion: ssh deployment-staging-cache-mobile01.pmtpa.wmflabs [13:46:36] hashar: but I can do ssh api1 [13:46:42] which is a different instance [13:47:02] what is your username on labs? [13:47:05] yurik [13:47:22] it confirms RSA key fingerprint and rejects [13:47:57] ahh [13:48:06] you have no home :-] [13:48:13] thanks!!! [13:48:17] that's harsh! [13:51:18] (03CR) 10Ottomata: "Faidon, I don't like this either, but if we have to turn off varnishncsa while still supporting downstream consumers of udp2log mobile tra" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86894 (owner: 10Ottomata) [13:51:40] which are these consumers, ottomata? [13:51:41] hashar: do i need to do something about home? [13:52:09] oh the downstream consumers [13:52:17] yurikNomad: Failed publickey for yurik from ** port 45190 ssh2 [13:52:28] i see that is actually a good question! [13:52:30] ** being bastion1.pmtpa.wmflabs [13:52:34] i will ask drdee [13:52:48] yurikNomad: lets move to #wikimedia-labs :-] [13:53:21] paravoid, one I know of right now is uhh…us :p [13:53:39] and we'd like to be able to keep generating mobile stats (out of hadoop) as a baseline for new data we generate [13:53:49] to see how we are doing, making sure that the numbers make sense [13:55:18] paravoid: erik zachte uses them for stats.wikimedia.org [13:55:36] http://stats.wikimedia.org/EN/TablesPageViewsMonthlyMobile.htm [13:56:52] hm, paravoid, we could run a separate udp2log instance for mobile data and have this hack send to it, rather than the full firehose [13:57:49] I think my point is [13:57:55] this is a proposed temporary solution [13:58:04] I'd like to know a) what exactly does it solve, b) how temporary is it [13:58:12] so we can also think about alternative temporary solutions :) [13:58:40] ok, here is a specific one for erik zachte [13:59:06] wikistats is a large perl codebase developed by erik zachte over many years to ingest webrequest udp2log log files and generate stats.wikimedia.org [13:59:34] who's going to do the udp2log -> kafka conversion for these consumers? [13:59:54] for wikistats, I presume erik; has anyone told him of the plans/asked him? :) [13:59:58] this patch does kafka -> udp2log [14:00:07] I know [14:00:08] why would we need udp2log -> kafka? [14:00:21] no, I meant adding kafka support to wikistats [14:00:24] isn't this the plan? [14:00:25] oh hah [14:00:26] no [14:00:27] uhhhhh [14:00:33] we will port wikistats to hadoop [14:00:38] what /is/ the plan? :) [14:00:46] that is the plan [14:00:50] okay [14:00:58] when approx.? [14:01:07] the problem is that we need to support legacy services while we transition [14:01:11] we can't just turn them off [14:01:21] mark wants us to turn varnishncsa off when we start up varnishkafka [14:01:26] we want to start working on that this fall as soon as we have hired the new backend engineer [14:01:40] I'm saying that this transition is very vague in terms of work needed/time it will take [14:01:42] i said there can be a transition period [14:01:46] oh! [14:01:49] with both on? [14:01:52] but not one that takes many months [14:01:55] hm [14:02:00] and we shouldn't consider temporary solutions unless we know how long this will take [14:02:01] it will probably take many months [14:02:18] agree with ottomata [14:02:19] doesn't need to be exact, but it needs to be loosely defined in terms of days/weeks/months/years [14:02:28] full port of wikistats to hadoop [14:02:29] ha [14:02:34] uh [14:02:35] not full port [14:02:43] only key performance indicators [14:02:58] besides wikistats, do we have other consumers? [14:03:08] the question is can we rely work on it full time (the port) or are we directed to do other thing [14:03:09] s [14:03:22] s/rely/really/ [14:03:27] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:36] Coren: that you? [14:04:17] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [14:05:03] if we would add page view counts for the mobile site at the article level in webstatscollector then that would be the 2nd consumer [14:05:28] that it is something that also gets more urgent as mobile traffic keeps growing [14:07:43] is that also going to be migrated to hadoop and is that going to happen after the backend engineer hire? [14:09:27] paravoid: Yes; sorry about the noise: since labstore4 was removed from decomissioned I forgot that'd make icinga noisy again. [14:09:44] k [14:10:09] it wasn't about the noise, it was about if I should do something about it :) [14:10:33] (just !log to alert others) [14:14:01] paravoid: Shouldn't the right thing to do to manage the alerts through nagios? I was about to look into putting the server in scheduled maintenance mode. [14:14:40] that works too, yes [14:14:50] but isn't a substitute for !log I think [14:15:12] logging gives the intention as well (rebooting, kernel upgrade) [14:16:19] Ah, interestingly enough I would have thought the opposite (put intent in nagios so that it can be looked up there without having to dig through logs) :-) [14:16:33] I'll just do both. [14:16:46] * Coren fails to locate credentials for Icinga. [14:17:49] it's labs :) [14:18:10] labstore3/4 aren't labs; they're "real". :-) [14:18:46] no, the credentials [14:18:57] PROBLEM - NTP on labstore4 is CRITICAL: NTP CRITICAL: Offset unknown [14:22:57] RECOVERY - NTP on labstore4 is OK: NTP OK: Offset -0.001353621483 secs [14:23:12] just use your wikitech username/password for icinga-admin [14:23:26] we used to call this "labs credentials", although I guess it's becoming less and less true [14:26:58] then you can search for hostname and there is a one-click "Schedule downtime for this host and all services" if you wanted to do it beforehand, (from the host page, host commands) or .. from a "service status details" page, you can use a checkbox to mark all service and from drop-down "acknowledege checked service problems" if you want to ack it after the fact [14:28:13] both should stop IRC bot output and further notifications and can stay for timeperiod until they expire or be sticky .. afaik [14:28:54] !log Labstore4 Configuring; will bounce up and down over the next two days. [14:29:06] Logged the message, Master [14:29:13] paravoid: Noted for future reference. I used the cheap way and talked straight to icinga through the command pipe. :-) [14:29:40] or that :) https://wikitech.wikimedia.org/wiki/Nagios#Scheduling_downtimes_with_a_shell_command [14:30:43] Oh FUCK ME! [14:30:51] I just rebooted neon. [14:31:01] awww [14:31:05] lol [14:31:07] ?? [14:31:11] * Coren hides in shame. [14:31:29] how ?... just typed reboot in the wrong prompt ? [14:31:31] heh [14:31:45] upgrade kernel and do it twice?:) [14:32:05] sounds like a plan :-) [14:32:23] mutante: No, just brain damage. As akosiaris guessed, I was still logged in neon when I said "Welp, it's in maintenance mode now so I can reboot it [labstore4]" [14:32:23] (03PS1) 10Ebrahim: Updating translation of Persian [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88108 [14:32:44] hey paravoid, unrelated to previous conversation [14:32:59] This is why I normally dislike having to log into boxen with root keys. [14:32:59] i'm trying to build an updated version of librdkafka, and it looks like I have to mess with the debian/librdkafka.symbols file [14:33:19] At least neon goes back up really quickly. :-0 [14:34:30] i wonder if we should install molly-guard [14:35:17] akosiaris: We almost certainly should. If only because that'd protect prod from /me/ :-) [14:35:25] * Coren likes molly-guard [14:35:25] (03PS2) 10Ebrahim: Updating translation of Persian [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88108 [14:36:02] i like molly-guard too... I hated it at first... but then... is circumventable if you indeed know what you are doing and protects from these kind of errors [14:36:53] akosiaris: And of the worse error of doing a 'poweroff' on a remote system you do not have a RAC/KVM for. :-) [14:38:01] :-) [14:38:30] been there, on a friday night, and the location was accesible "on Monday" [14:38:58] mutante: Of /course/ it had to be a friday night. :-) [14:39:16] Coren: it happens when the mouse focus is over another terminal but you think your local laptop shut "shutdown -h now" [14:39:39] since then no mouse focus change without click [14:40:06] I'm old skool X; I use "focus follows mouse" [14:40:14] oh man [14:40:26] (03CR) 10Ebrahim: "Reviewer can look at the old version http://tools.wmflabs.org/ebraminio-dev/php-fatal-error_old.html and the new version http://tools.wmfl" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88108 (owner: 10Ebrahim) [14:40:35] well if it's in eqiad you can always call me [14:40:37] i keep moving my mouse pointer out of the currently focused window [14:41:02] cmjohnson1: We were discussion nightmare scenarios, not what happened now. :-) Thankfully. :-) [14:48:06] (03CR) 10Reza: [C: 031] "it is ok" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88108 (owner: 10Ebrahim) [14:50:51] !log Restarting Parsoid on wtp10[01-24] on request; load avg reaching 90% [14:51:05] Logged the message, Mr. Obvious [14:51:32] (03CR) 10Akosiaris: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88009 (owner: 10Ori.livneh) [14:53:42] Oh, FFS. Why does Dell always do everything more complicated than it needs to be? [14:54:23] Stupid shelves won't do JBOD right. [14:56:39] akosiaris: .deb question for you [14:56:49] i need to generate an updated symbols file for librdkafka [14:56:56] (03PS1) 10Dzahn: delete zwinger and zwinger2 from wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/88113 [14:57:01] is it ok to just take the output of the dpkg-symbols command and use it? [14:57:11] baseically replace the old .symbols file with the new one? [14:57:13] I did this: [14:57:19] dpkg-gensymbols -plibrdkafka1 -Pdebian/librdkafka1 -edebian/librdkafka1/usr/lib/librdkafka.so.1 -O/tmp/symbols [14:57:30] cp /tmp/symboles debian/librdkafka1.symbols [14:57:33] and then the package built fine [14:58:50] (03CR) 10Ottomata: [C: 032 V: 032] Updated for new librdkafka API. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/87472 (owner: 10Edenhill) [15:12:10] (03PS1) 10ArielGlenn: access to analytics boxes for Halfaker (rt 5836), Tarborelli (rt 5835) [operations/puppet] - 10https://gerrit.wikimedia.org/r/88116 [15:13:20] ottomata: ^^ anything else needed for them for those boxes? [15:13:36] ottomata: lemme see [15:13:38] I specifically didn't add them to the group cause not team members (are they?) but maybe itns' needed [15:13:49] *it's [15:17:04] that should do it apergos, thank you [15:17:15] ok doing it now [15:17:16] ottomata: what's the diff between the two, ottomata? [15:17:24] er, once even :) [15:17:32] the symbol diff [15:17:40] hmmm [15:17:41] (03CR) 10ArielGlenn: [C: 032] access to analytics boxes for Halfaker (rt 5836), Tarborelli (rt 5835) [operations/puppet] - 10https://gerrit.wikimedia.org/r/88116 (owner: 10ArielGlenn) [15:17:43] err [15:19:32] paravoid: https://gist.github.com/ottomata/6869713 [15:19:54] (03PS1) 10Dzahn: remove references to old servers like zwinger [operations/dns] - 10https://gerrit.wikimedia.org/r/88120 [15:20:42] remove the -1ubuntu1 [15:20:53] and remove the MISSING lines [15:21:17] and make sure to tell Snaps about the missing lines [15:22:39] (03CR) 10Faidon Liambotis: "They're not ns0/1/2 for sure for several years, but do they even exist anymore? wikimedia.org/wmnet still have zwinger & pascal." [operations/dns] - 10https://gerrit.wikimedia.org/r/88120 (owner: 10Dzahn) [15:24:08] (03CR) 10Dzahn: "i was going to remove zwinger in: Change-Id: I13c571c36b070d2c7610a9547e342a3a2bf20060" [operations/dns] - 10https://gerrit.wikimedia.org/r/88120 (owner: 10Dzahn) [15:37:09] (03CR) 10Dzahn: "regarding pascal: it's ntp.esams.wikimedia.org but 100% packet loss from an esams host, and appservers there just use dobson and linne in " [operations/dns] - 10https://gerrit.wikimedia.org/r/88120 (owner: 10Dzahn) [15:40:14] (03PS8) 10Krinkle: Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [16:03:02] (03PS9) 10Krinkle: Enable VisualEditor on "phase 2" Wikipedias (anons) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [16:03:27] (03CR) 10Krinkle: [C: 032] Enable VisualEditor on "phase 2" Wikipedias (anons) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [16:03:38] (03Merged) 10jenkins-bot: Enable VisualEditor on "phase 2" Wikipedias (anons) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [16:03:44] (03CR) 10Krinkle: "Rebased." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [16:05:24] !log krinkle synchronized wmf-config/InitialiseSettings.php 'I59e2547002bac5' [16:05:42] Logged the message, Master [16:06:01] This is the first time sync-file finished without a single error from a node that is not available [16:06:10] Is something broken :P ? [16:15:16] fixed the one remaining issue today but [16:15:39] I reserve the right to have reservations :-P [16:16:43] (03PS1) 10Dzahn: add dsh group "named-servers" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88126 [16:18:11] (03PS2) 10Dzahn: add dsh group "named-servers" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88126 [16:27:53] (03PS1) 10Chad: Updates for 2.7-rc2-507-g1e7090b [operations/debs/gerrit] - 10https://gerrit.wikimedia.org/r/88129 [16:29:12] (03CR) 10Chad: [V: 032] "war available: https://gerrit.wikimedia.org/gerrit-2.7-rc2-507-g1e7090b.war" [operations/debs/gerrit] - 10https://gerrit.wikimedia.org/r/88129 (owner: 10Chad) [16:44:43] Ryan_Lane: hello [16:45:10] do you know whether wiki* are blocked in IR on ipv6? [16:45:56] and it might be useful to make wmgHTTPSBlacklistCountries separated by ipv4/v6 [16:48:45] liangent: no clue [16:49:48] paravoid: ping [17:09:53] (03PS1) 10coren: Labs NFS: Minor fixes to maintain-replicas [operations/software] - 10https://gerrit.wikimedia.org/r/88143 [17:10:20] (03CR) 10coren: [C: 032 V: 032] "Reflects status-quo (live version)" [operations/software] - 10https://gerrit.wikimedia.org/r/88143 (owner: 10coren) [17:13:24] (03PS1) 10coren: Labs DB: restore copyright notice/license [operations/software] - 10https://gerrit.wikimedia.org/r/88144 [17:14:07] (03PS2) 10coren: Labs DB: restore copyright notice/license [operations/software] - 10https://gerrit.wikimedia.org/r/88144 [17:14:39] (03CR) 10coren: [C: 032 V: 032] "+license" [operations/software] - 10https://gerrit.wikimedia.org/r/88144 (owner: 10coren) [17:15:52] (03CR) 10CSteipp: "We definitely want this in beta in the future, since I've found several bugs that happen only when wgSecureLogin is enabled. But if it's c" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87045 (owner: 10Mattflaschen) [17:25:17] (03PS1) 10Dzahn: rm williams from site.pp,use iodine for otrs mail [operations/puppet] - 10https://gerrit.wikimedia.org/r/88145 [17:30:25] (03PS1) 10Faidon Liambotis: Fix mgmt IP clash between nas1-a/b & ms-fe1/2 [operations/dns] - 10https://gerrit.wikimedia.org/r/88146 [17:30:51] (03PS1) 10Dzahn: remove "williams" from DNS, RT #5908 [operations/dns] - 10https://gerrit.wikimedia.org/r/88147 [17:31:04] heh [17:32:29] (03PS2) 10Dzahn: rm williams from site.pp,use iodine for otrs mail [operations/puppet] - 10https://gerrit.wikimedia.org/r/88145 [17:32:46] (03CR) 10RobH: "williams is also in icinga monitoring, so this patchset should include decommission.pp addition as well." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88145 (owner: 10Dzahn) [17:35:54] (03CR) 10Chad: "I put together the changes for the debian package, just need it built and put on apt.wm.o." [operations/puppet] - 10https://gerrit.wikimedia.org/r/84743 (owner: 10QChris) [17:36:28] (03PS3) 10Dzahn: rm williams from site.pp,use iodine for otrs mail [operations/puppet] - 10https://gerrit.wikimedia.org/r/88145 [17:39:51] (03PS1) 10coren: Labs DB: Add labsdb-side second line of defense [operations/software] - 10https://gerrit.wikimedia.org/r/88149 [17:41:06] (03PS2) 10coren: Labs DB: Add labsdb-side second line of defense [operations/software] - 10https://gerrit.wikimedia.org/r/88149 [17:41:45] (03CR) 10Faidon Liambotis: [C: 032] Fix mgmt IP clash between nas1-a/b & ms-fe1/2 [operations/dns] - 10https://gerrit.wikimedia.org/r/88146 (owner: 10Faidon Liambotis) [17:42:02] mutante: shall I merge yours too? [17:42:18] (03PS3) 10coren: Labs DB: Add labsdb-side second line of defense [operations/software] - 10https://gerrit.wikimedia.org/r/88149 [17:42:40] no, it still pings [17:42:59] (03CR) 10Dzahn: "i need to know if i can add this for a use group that is not a real chapter yet" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/86652 (owner: 10Dzahn) [17:43:11] paravoid: which one ? [17:43:34] williams [17:43:54] no wait, unless you know that using iodine in that exim template is right [17:43:56] !log changed ms-fe1/2's mgmt IPs, ip clash with nas1-a/b's e0M [17:44:06] I didn't because the box still pings [17:44:12] Logged the message, Master [17:44:15] was about to wait for OTRS migration person [17:44:47] https://gerrit.wikimedia.org/r/#/c/88145/ [17:45:15] remove from DNS after remove from puppet [17:45:51] (03CR) 10Jgreen: [C: 031 V: 031] "looks good" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88145 (owner: 10Dzahn) [17:47:18] eh :) now you can if you like to also merge the puppet one or Jeff, but i was about to run out the door so cant watch it [17:49:56] mutante: did you do dns patchset? [17:50:25] RobH: yes https://gerrit.wikimedia.org/r/#/c/88147/ [17:50:32] out,bbl [17:50:48] cya, i'll snag the dns and merge post removal from icinga, etc... [17:51:23] thx:) [17:53:56] (03PS4) 10RobH: rm williams from site.pp,use iodine for otrs mail [operations/puppet] - 10https://gerrit.wikimedia.org/r/88145 (owner: 10Dzahn) [17:56:03] (03CR) 10RobH: [C: 032] rm williams from site.pp,use iodine for otrs mail [operations/puppet] - 10https://gerrit.wikimedia.org/r/88145 (owner: 10Dzahn) [18:07:26] (03PS4) 10coren: Labs DB: Add labsdb-side second line of defense [operations/software] - 10https://gerrit.wikimedia.org/r/88149 [18:17:27] !log reedy synchronized php-1.22wmf20/extensions/Wikibase [18:17:40] Logged the message, Master [18:22:09] hrmmm, no Coren in #-tech [18:22:27] Yet another channel to join? Eeeu. [18:23:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything non 'pedia to 1.22wmf20 [18:24:11] Logged the message, Master [18:26:07] (03PS1) 10Reedy: Everything non 'pedia to 1.22wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88153 [18:27:38] (03CR) 10Reedy: [C: 032] Everything non 'pedia to 1.22wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88153 (owner: 10Reedy) [18:27:48] (03Merged) 10jenkins-bot: Everything non 'pedia to 1.22wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88153 (owner: 10Reedy) [18:38:05] !log backporting python-docopt 0.6.1 for precise and including in our apt repo [18:38:16] Logged the message, Master [18:39:57] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: commonswiki back to 1.22wmf19 [18:40:07] Logged the message, Master [18:44:50] Is there a way to add default messages matching a userright in InitialiseSettings.php, or is it expected that the wiki's communities will add them? [18:45:50] I.e. provide a default for MediaWiki:Group-foo and friends? [18:46:26] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [18:49:13] They should usually be added by core/whatever extension... [18:49:19] And then translated at translatewiki [18:49:28] Some exceptions go into WikimediaMessages [18:51:16] Reedy: This is a project-specific userright, so I'm guessing it really doesn't want to be in core. [18:51:28] Which project? [18:51:31] enwiki [18:52:03] WikimediaMessages then [18:52:12] which is an extension [18:52:19] * Coren nods. This makes exactly 0.97 senses. [18:53:08] There's already stupid things such as: [18:53:09] 'group-Ex_Administrator' => 'Ex administrators', [18:53:09] 'group-Ex_Administrator-member' => '{{GENDER:$1|ex administrator}}', [18:53:09] 'grouppage-Ex_Administrator' => '{{ns:project}}:Ex administrators', [18:55:41] "Ex administrator"? o_O [18:55:41] !log reedy synchronized php-1.22wmf20/thumb.php [18:55:53] Logged the message, Master [18:56:30] (03PS3) 10Ebrahim: Updating translation of Persian [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88108 [18:56:48] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: commonswiki back to 1.22wmf20, thumb.php fixed [18:57:03] Logged the message, Master [18:57:16] :) [18:59:19] (03CR) 10Reedy: "If a (native) speaker of Persian can confirm that this translation is ok, I'll then deploy it." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88108 (owner: 10Ebrahim) [19:00:02] (03PS6) 10Reedy: Allow Commons admins self-adding translationadmin group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86366 (owner: 10Rillke) [19:00:08] (03CR) 10Reedy: [C: 032] Allow Commons admins self-adding translationadmin group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86366 (owner: 10Rillke) [19:00:19] (03Merged) 10jenkins-bot: Allow Commons admins self-adding translationadmin group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86366 (owner: 10Rillke) [19:00:27] (03PS4) 10Reedy: Set Europe/Minsk TZ for bewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86379 (owner: 10Wizardist) [19:00:32] (03CR) 10Reedy: [C: 032] Set Europe/Minsk TZ for bewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86379 (owner: 10Wizardist) [19:00:46] (03Merged) 10jenkins-bot: Set Europe/Minsk TZ for bewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86379 (owner: 10Wizardist) [19:01:08] (03PS1) 10Odder: (bug 4883) Edit $wgSiteName and set up NS aliases for ukwikinews [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88188 [19:02:05] (03PS3) 10Reedy: Enable subpages in Programs namespace of metawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86414 (owner: 10TTO) [19:02:12] (03CR) 10Reedy: [C: 032] Enable subpages in Programs namespace of metawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86414 (owner: 10TTO) [19:02:19] Reedy: Ah, WikimediaMessages has all the group- fun, but none of the userrights- stuff. [19:02:35] (that I can see) [19:03:52] Hmm [19:04:16] Which userright? [19:04:33] (03Merged) 10jenkins-bot: Enable subpages in Programs namespace of metawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86414 (owner: 10TTO) [19:04:34] (03PS3) 10Reedy: Change SUL image for loginwiki to WMF logo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86091 (owner: 10TTO) [19:04:34] (03CR) 10Reedy: [C: 032] Change SUL image for loginwiki to WMF logo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86091 (owner: 10TTO) [19:04:35] (03Merged) 10jenkins-bot: Change SUL image for loginwiki to WMF logo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86091 (owner: 10TTO) [19:04:35] (03PS3) 10Reedy: Set up rollbacker and filemover groups on hiwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85961 (owner: 10TTO) [19:04:39] (03CR) 10Reedy: [C: 032] Set up rollbacker and filemover groups on hiwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85961 (owner: 10TTO) [19:04:52] (03Merged) 10jenkins-bot: Set up rollbacker and filemover groups on hiwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85961 (owner: 10TTO) [19:05:09] (03PS3) 10Reedy: Set logo for ukwikisource per community request [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85838 (owner: 10TTO) [19:05:20] (03CR) 10Reedy: [C: 032] Set logo for ukwikisource per community request [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85838 (owner: 10TTO) [19:05:39] (03Merged) 10jenkins-bot: Set logo for ukwikisource per community request [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85838 (owner: 10TTO) [19:05:46] (03CR) 10Reedy: "There's not been any mass changes or anything. The rebases were trivial, so possibly just jgit related fail" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85838 (owner: 10TTO) [19:06:49] (03PS4) 10Reedy: Wnable $wgUseRCPatrol on fawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86093 (owner: 10TTO) [19:06:53] (03PS5) 10Reedy: Wnable $wgUseRCPatrol on fawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86093 (owner: 10TTO) [19:07:09] (03CR) 10Reedy: [C: 032] "You reverted my commit message fix!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86093 (owner: 10TTO) [19:07:21] (03Merged) 10jenkins-bot: Enable $wgUseRCPatrol on fawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86093 (owner: 10TTO) [19:07:44] (03PS3) 10Reedy: (bug 54229) Add autopatrolled user group on ukwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87275 (owner: 10Odder) [19:08:03] (03CR) 10Reedy: [C: 032] (bug 54229) Add autopatrolled user group on ukwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87275 (owner: 10Odder) [19:08:11] (03Merged) 10jenkins-bot: (bug 54229) Add autopatrolled user group on ukwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87275 (owner: 10Odder) [19:08:22] (03PS3) 10Reedy: (bug 54922) Add an accountcreator user group on svwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87432 (owner: 10Odder) [19:08:29] (03CR) 10Reedy: [C: 032] (bug 54922) Add an accountcreator user group on svwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87432 (owner: 10Odder) [19:08:41] (03Merged) 10jenkins-bot: (bug 54922) Add an accountcreator user group on svwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87432 (owner: 10Odder) [19:08:55] (03PS2) 10Reedy: Remove EditPageTracking extension [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87896 (owner: 10Ori.livneh) [19:08:59] (03CR) 10Reedy: [C: 032] Remove EditPageTracking extension [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87896 (owner: 10Ori.livneh) [19:09:16] (03Merged) 10jenkins-bot: Remove EditPageTracking extension [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87896 (owner: 10Ori.livneh) [19:09:19] Reedy: ohhhhh no [19:09:35] DarTar just sent me an e-mail last night saying they were thinking of keeping it [19:10:59] (03PS2) 10Reedy: (bug 53904) Point local sidebars to UploadWizard on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88072 (owner: 10Odder) [19:11:03] (03CR) 10Reedy: [C: 032] (bug 53904) Point local sidebars to UploadWizard on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88072 (owner: 10Odder) [19:11:14] (03Merged) 10jenkins-bot: (bug 53904) Point local sidebars to UploadWizard on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88072 (owner: 10Odder) [19:11:43] (03PS2) 10Reedy: (bug 4883) Edit $wgSiteName and set up NS aliases for ukwikinews [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88188 (owner: 10Odder) [19:11:48] (03CR) 10Reedy: [C: 032] (bug 4883) Edit $wgSiteName and set up NS aliases for ukwikinews [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88188 (owner: 10Odder) [19:12:00] (03Merged) 10jenkins-bot: (bug 4883) Edit $wgSiteName and set up NS aliases for ukwikinews [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88188 (owner: 10Odder) [19:12:01] ori-l: you should have -1'd the patch... [19:12:13] Nemo_bis: yes, I know [19:12:20] but DarTar is saying it's OK [19:12:27] so crisis averted [19:12:34] ^ Reedy [19:12:47] (03PS5) 10Reedy: $wgCaptchaWhitelist: whitelist also links with query or anchor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83225 (owner: 10Umherirrender) [19:12:52] (03CR) 10Reedy: [C: 032] $wgCaptchaWhitelist: whitelist also links with query or anchor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83225 (owner: 10Umherirrender) [19:12:53] Coren: WikimediaMessages only has non-standard groups; I doubt we have non-standard *permissions* [19:13:03] (03Merged) 10jenkins-bot: $wgCaptchaWhitelist: whitelist also links with query or anchor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83225 (owner: 10Umherirrender) [19:13:04] ori-l: oh, nice :) [19:13:29] Nemo_bis: Looks like it. Not like it can't be added to the Mediawiki: ns by the community anyways. [19:13:41] Coren: so what [19:14:17] Well yeah, that's my point. "No worries". [19:14:26] ok [19:14:40] it's customisation vs. translation [19:14:44] It'd have been nice to have had suitable defaults in place, but it's not a concern if they aren't there when it can be solved by about two edits. :-) [19:15:05] defaults are supposed to be suitable [19:15:06] (03PS3) 10Reedy: Set up patroller user group on frwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85644 (owner: 10TTO) [19:15:21] (03CR) 10Reedy: [C: 032] Set up patroller user group on frwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85644 (owner: 10TTO) [19:15:29] (03Merged) 10jenkins-bot: Set up patroller user group on frwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85644 (owner: 10TTO) [19:15:48] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [19:16:21] (03PS3) 10Reedy: Allow crats on outreachwiki to revoke translationadmin group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85513 (owner: 10TTO) [19:16:33] (03CR) 10Reedy: [C: 032] Allow crats on outreachwiki to revoke translationadmin group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85513 (owner: 10TTO) [19:16:45] (03Merged) 10jenkins-bot: Allow crats on outreachwiki to revoke translationadmin group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85513 (owner: 10TTO) [19:17:06] oh, found the conversation above; I'm confused by en.wiki having local-only permissions O_o [19:18:12] Importing Karrotter,_ett_par_med_rechaud-Deep_-_Hallwylska_museet_-_30774.tif...Redis server error: protocol error, got '�' as reply-type byte [19:19:23] sob [19:22:09] (03PS4) 10Ebrahim: Updating translation of Persian [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88108 [19:22:25] interesting, it was uploaded just 9 min ago, it had a thumb and purging removed it urrecoverably [19:24:15] and a new purge worked [19:25:11] (03CR) 10Mardetanha: "Native speaker comment: Translation is OK, you may deploy" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88108 (owner: 10Ebrahim) [19:25:57] Erm. git review -D really doesn't work the way I'd expect. [19:28:13] PROBLEM - search indices - check lucene status page on search1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 53503 bytes in 0.025 second response time [19:28:13] PROBLEM - search indices - check lucene status page on search1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 59253 bytes in 0.014 second response time [19:28:13] PROBLEM - Disk space on cp1058 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 3623 MB (1% inode=99%): /srv/sdb3 3615 MB (1% inode=99%): [19:28:13] PROBLEM - HTTPS on amssq47 is CRITICAL: Connection refused [19:32:49] Hm. [19:33:04] twkozlowski: it's not sync'ed yet, if it's your question [19:33:28] assuming logmsgbot is not a muted liar [19:34:00] Nemo_bis: I was like... '[remote rejected] git: change 88188 closed.' => me: WTF... [19:34:09] ^^ [19:34:18] then I looked and was like o_0 [19:34:23] did you want to amend something? [19:34:55] I left a whitespace at the end of a line [19:34:59] -,-'' [19:35:35] !log reedy synchronized wmf-config/ [19:35:35] you should get a barnstar "managed to get trailing waitspace pass the gerrit review and be merged" [19:35:49] Logged the message, Master [19:36:03] RECOVERY - Puppet freshness on williams is OK: puppet ran at Mon Oct 7 19:36:00 UTC 2013 [19:36:56] twkozlowski: weird, for once sidebar caching is not acting weird :D [19:37:04] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [19:37:57] (03PS1) 10coren: Add templateeditor right, group, and restriction [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88196 [19:38:29] (03CR) 10coren: [C: 04-2] "Not intended for submission yet." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88196 (owner: 10coren) [19:41:13] Nemo_bis: so it works? :) wee [19:45:59] (03PS1) 10Ori.livneh: Prefer WikimediaEvents to CoreEvents, now that the extension has been renamed. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88199 [19:46:36] ori-l: heya [19:46:48] all swift in both pmtpa & eqiad are pushing stats to statsd fwiw [19:46:51] it's a bit... overwhelming [19:46:59] maybe I need to adjust the sampling rate [19:48:02] about 500 udp per second per box by my very rought measurement :) [19:48:11] 450 maybe [19:51:49] (03CR) 10coren: [C: 031] "This is adequate, though I still have my reservations about the proliferations of files in roles/*" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84926 (owner: 10Yuvipanda) [19:52:51] (03CR) 10coren: [C: 032] "-dev packages should not be deployed to the execution environment, but to dev_environ instead." [operations/puppet] - 10https://gerrit.wikimedia.org/r/84288 (owner: 10DrTrigon) [19:53:08] (03CR) 10coren: [C: 04-2] "That was meant to be a -2" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84288 (owner: 10DrTrigon) [19:58:16] paravoid: uh, yeah, that sounds like a lot [20:03:22] Hm, does a self -1 prohibit the Gerrit bot link to a patch on Bugzilla like Coren did with https://gerrit.wikimedia.org/r/#/c/88196/ ? [20:05:12] Thehelpfulone: @ [20:05:43] !log LocalisationUpdate failed: git pull of extensions failed [20:05:57] Logged the message, Master [20:06:05] twkozlowski: I was wondering that myself. [20:06:57] :) [20:10:05] (03PS1) 10Andrew Bogott: Remove the trivial class base::mwclient [operations/puppet] - 10https://gerrit.wikimedia.org/r/88214 [20:10:16] uh l10nupdate fail? [20:10:27] It's not scheduled run time [20:10:31] Testing something ori noticed [20:10:34] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [20:11:48] ah [20:11:51] I was wondering [20:21:15] hmmm, localisation failed? [20:21:29] see above. [20:21:37] yeah [20:23:24] aude: l10nupdate barfs whenever an extension updates its reference to an externally-hosted submodule [20:23:43] i wonder which is to blame [20:23:45] i brought it up in #mediawiki_security because i worried it was a security issue, but it isn't [20:23:45] !log LocalisationUpdate completed (1.22wmf20) at Mon Oct 7 20:23:45 UTC 2013 [20:23:51] lololol [20:23:58] :) [20:23:58] Logged the message, Master [20:24:08] Reedy: hrm? [20:25:30] so we have #mediawiki_security now? [20:25:47] srsly [20:25:51] * Platonides mumbles about channel proliferation [20:26:03] Platonides: bribe a GC to kill it [20:26:07] also, underscore?! [20:26:26] the underscore seems to break all our channel convenrions [20:26:31] *conventions [20:26:52] Registered : Apr 06 18:36:11 2005 [20:26:57] hah [20:27:21] I think we had some underscores still in 2009 or so but they got killed? [20:27:29] <^d> OH NO NOT UNDERSCORES! [20:33:53] !log LocalisationUpdate completed (1.22wmf19) at Mon Oct 7 20:33:53 UTC 2013 [20:34:04] Logged the message, Master [20:36:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Mon Oct 7 20:36:50 UTC 2013 [20:37:34] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [20:46:05] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Oct 7 20:46:05 UTC 2013 [20:46:18] Logged the message, Master [20:49:07] hasharCall: :-) [20:53:48] paravoid: Are you on wikitech-l? [20:55:02] I am [20:55:25] cool. I just sent an email there announcing the draft RFC [20:55:31] I saw :) [20:55:38] thanks for rewording my comments [20:55:40] much better [20:56:35] Oh thanks. Glad I didn't make them worse [21:06:02] RECOVERY - Puppet freshness on williams is OK: puppet ran at Mon Oct 7 21:05:59 UTC 2013 [21:06:43] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [21:25:15] (03PS1) 10Edenhill: Use LRU hash for logline cache to avoid memory leak [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/88234 [21:25:54] bblack, you around? [21:27:58] bblack++ [21:31:39] dr0ptp4kt: I think the more important question that is left a bit unanswered in this chain of emails is: does anything else except *.{m,zero}.wp.org count? [21:31:47] bits/upload? [21:36:01] RECOVERY - Puppet freshness on williams is OK: puppet ran at Mon Oct 7 21:35:53 UTC 2013 [21:36:01] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [21:38:40] paravoid, i prefer that the definition of a hit is inherited from the general mobile web definition. i don't know if bits/upload are counted for the general mobile web definition? [21:39:26] dunno [21:39:28] paravoid, the plan is going to be as carriers are switched over to ip address-whitelisting to get banners going on all projects. granted, wikipedia gets the lion's share of hits. [21:39:40] that is not the issue [21:40:54] mobile is a separate varnish cluster and has been the target thus far [21:41:09] from e.g. varnishkafka deployments to zero tagging [21:42:16] paravoid, yeah, i think i get it. the first qualifier for something to be a mobile web hit is a domain of [.]m..org, correct? [21:42:33] you'd have to ask analytics for that :) [21:42:43] but I'm interested in knowing the answer too [21:42:54] drdee_: ? [21:42:55] paravoid, i should say, the cluster is represented by that pattern, right? [21:43:34] oh [21:43:37] paravoid, not sure if https://raw.github.com/wikimedia/metrics/master/pageviews/new_mobile_pageviews_report/pageview_definition.png is the latest in the state of the art, but i'm gujessing it probably is [21:43:38] that, plus zero [21:43:41] * drdee_ is reading [21:43:58] .{m,zero}..org [21:44:14] dr0ptp4kt: that's the wrong link [21:44:16] drdee_ i think paravoid's first message to me was a reply to my latest email :) [21:44:39] plus without , or with "www" instead of lang, or .mobile. etc., but these are all redirects [21:44:43] "dr0ptp4kt: paravoid, i prefer that the definition of a hit is inherited from the general mobile web definition." [21:44:51] yes that's correct [21:45:10] the zero starting point is the endpoint of the regular mobile page view definition [21:45:13] does the generic mobile web definition include API, bits, upload? [21:45:29] it does include API [21:45:36] it excludes bits and upload (IIRC) [21:45:42] okay [21:45:42] but these are living documents :) [21:46:03] drdee_, thx. yeah :) [21:46:39] so the plan is for analytics to post-process the logs to do carrier-detection as I proposed in one of the bugs? [21:46:44] is that correct? [21:47:26] that seems more like a question for a mailnglist :) [21:47:35] paravoid, drdee_, (+cc yurik and yurik_) … not sure about technical implementation ... [21:48:01] ok, I wasn't sure if this is something that has been agreed or not [21:48:28] i think we are moving towards that direction but i would like to have more public conversation about it [21:52:30] paravoid, drdee (cc yurik yurk_) yeah it's worth thinking through on email, i think. seems like the X-CS header needs to be there for identifying carrier sourced traffic and varying the cache, but as to how things are counted, i'm sort of indifferent on that. one upside of logging the X-CS value in the varnish logs is that, as long as the configuratons are polled periodically via the netmapper utilities to ensure the latest ip addresses f [21:52:30] carriers, opera, etc. are accurate - the X-CS gives a good idea of the number of hits without having to figure out what the ip addresses were at the time the hit registered. [21:53:10] it's more complicated than that [21:53:11] paravoid, drdee (cc yurik yurik_ ), but it's also obviously in principal to look at the history of configs for any reconciliation [21:53:43] I want the "zero languages per carrier" in VCL to be gone [21:54:25] paravoid, you referring to the notion of only X languages being supported by some carriers? [21:54:30] I am [21:57:01] paravoid, yeah, if the X-CS header isn't present, that's "one" variant of the header upon which general clients get the bannerless page, so it wouldn't cause cache object proliferation. that is to say, if ZeroRatedMobileAccess bails out and doesn't add the Vary: header for X-CS it shouldn't be problematic. but then i see where you're going with reconciliation of hits later on. [21:57:05] bblack, you back? [21:57:11] yes [21:57:21] I'm ok with reconciliation [21:57:41] bblack, right on. i actually need to step away from my desk but will be back in about 2.5 minutes. cc paravoid. [21:57:42] I'm just saying, Varnish by itself isn't sufficient for analytics [22:00:27] bblack, paravoid, drdee, back. putting headphones on. headbop in 3, 2, 1... [22:00:33] for? [22:00:59] paravoid, helps the brain. [22:01:02] ok... [22:05:12] bblack, the netmapper stuff for opera is contingent on the first part of tag_carrier in https://git.wikimedia.org/raw/operations%2Fpuppet.git/production/templates%2Fvarnish%2Fzero.inc.vcl.erb only, right? or is there anything else? paravoid, do you know if there's anything else going on with opera besides that first part of zero.inv.vcl.erb and the one entry (X-CS2 == 520-18 part of file) ? [22:06:13] I don't see how 520-18 is special [22:07:04] dr0ptp4kt: stuff for opera, contingent? [22:07:16] I really don't understand the question at all :) [22:08:24] it is as the code says it is there, I guess [22:08:29] paravoid, 520-18 says to only tag the traffic if the ua is opera. bblack, i was just wondering it there's any munging of stuff for opera-sourced traffic coming from netmapper. if not, that's cool…may need to make a feature enhancement request. [22:09:19] I don't see that [22:09:22] bblack, the other thing i was wondering was if you would be able to do a google hangout and walk me a bit through netmapper. i've sort of been tuning out on discussions on that stuff because i knew you and yurik had it covered. but now that it's stable it would be good to learn. [22:09:24] what do you mean by munging, and why would netmapper know anything about opera specifically, aside from the -OPERA data being sent by the zero metadata? [22:09:45] paravoid, i meant *520-16* says that, not 502-18 [22:10:02] oh [22:10:08] no idea what's up with that [22:10:20] I really want all this carrier logic gone from there :) [22:10:22] paravoid, i guess they had a deal whereby only opera traffic is supposed to be in-scope for w0. that's not part of future arrangements, though. [22:10:48] dr0ptp4kt: those conditions (e.g. what's going on in 520-16) come directly from your team. We're under the impression they go away completely at the varnish layer at some point once you've upgraded the app layer to handle that mess (all of those if-conditionals) [22:10:56] hey, i'm going to be moving some european traffic around in about an hour [22:11:32] dr0ptp4kt: and User-Agent header checking is very different than Opera Mini proxy detection by IP. They don't mean the same thing at all. [22:11:49] paravoid, bblack, yeah if you guys think letting the origin handle it (which is *relatively* computationally cheap) is okay, i think we should explore it further. [22:12:07] the origin handles it anyway, doesn't it [22:12:29] bblack, i agree, the inherited ua checking logic should be revisited. paravoid and bblack, as you probably wagered, i'm really interested in opera related stuff right now. [22:12:46] how come? [22:13:00] paravoid, yeah, it's the invocation of the ZeroRatedMobileAccess extension to which i refer. that is all in all pretty fast. [22:13:50] it's not so much about what's fast, it's about what belongs in the cache layer in varnish and what doesn't [22:14:02] paravoid, bblack, trying to ensure that the configurations in the opera infra map up to the vcl. [22:14:13] I don't understand [22:14:22] also, no reason to say our nicknames all the time :) [22:14:27] if we need it in varnish because it affects caching, that's ok. or the XFF/client-ip stuff is special, the cache needs to do the X-CS matching or whatever [22:15:13] but if there's no good reason a peice of functionality *has* to happen at the varnish layer to be correct, then it shouldn't be there [22:15:44] bblack, i hear you. [22:15:49] (03PS1) 10Bsitu: Enable Echo and Thanks on Various wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88246 [22:16:46] (03PS2) 10Bsitu: Enable Echo and Thanks on Various wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88246 [22:18:56] paravoid, essentially, opera's ops people can have traffic for particular url patterns go through different load balancer slots if arranged with carriers. and when they do go through slots other than the default (default = slot 0), it's easier to tell that zero-rating must have been arranged between the carrier and opera. so i'm trying to optimize the configurations. [22:19:29] uhm, okay [22:19:33] "slots" == "sets of loadbalancers" ? [22:19:43] I don't think we should care about all that at all [22:19:47] but in any case, it doesn't seem like it makes anything easier to even be aware of that. [22:19:47] but feel free [22:20:17] :) [22:20:34] you'd still need explicit matching for the carrier's networks for all the zero-rated browsers that aren't opera-mini, so there's no point looking at specific opera LBs [22:20:50] (although we'd still unwrap an XFF entry from opera in that general case) [22:21:20] on that note, we were looking the other day at a request coming from a Nokia network with XFF [22:21:35] so it might be worth it to do similar matching with those [22:22:02] I'd go as far as to say that we should think about moving opera/nokia matching outside of zero and use it to trust their XFF [22:22:16] but let's call this a long term goal [22:22:36] paravoid: do we care other than in the zero case, though? Just to get anon user IPs correct I guess? [22:22:39] bblack, yeah, in order to know that the ip should have an x-cs value mapped to it, knowing the slot, in addition to the approach taken by the carrier (zero-rate slot 0, or zero-rate only non-slot-0), as well as examining yet another header sent by opera mini's proxies are things we need to know. [22:22:48] bblack: yep, that was the idea [22:23:29] oh come on [22:23:45] who makes this deals with carriers [22:23:49] agree that other major proxying services are noteworthy, i mentioned it to the business team. [22:23:58] ^ the business team does [22:24:00] dr0ptp4kt: are you saying sometimes the zero agreement *only* covers traffic that both came from Carrier-Foo *and* was Opera-Mini? (as in, they won't zero-rate other browsers on their network?) [22:24:27] that's already the case, as he pointed out with 502-16 [22:24:30] 520-16 even [22:24:39] bblack, there's only one carrier where the configuration explicitly only is supposed to zero-rate opera. [22:24:56] Yeah but 520-16 doesn't even do that, it just checks User-Agent, which is not telling you anything about Opera -vs- OperaMini, and is easily faked... [22:25:18] oh, true [22:25:23] wth, this is just crazy [22:25:38] ok, I have two things [22:25:46] yeah, that's something to be cleaned up [22:26:07] paravoid, bblack you guys want to jump on a google hangout? may be quicker! [22:26:27] a) we need to work towards having a Varnish config that's basically "set req.http.X-CS2 = netmapper.map("zero", "" + client.ip);" + opera mini/nokia/trusted proxies untangling [22:26:43] I would rather we didn't, and the requirements coming down to ops from heaven just made logical sense when they arrived :) [22:26:58] LOL [22:27:10] b) please don't sign contracts that say "if the carrier ip is X and the XFF has that Y and header H is Z" [22:28:10] Nobody tell them that the contract could specify that zero-rated mobile access should only be enabled if a parity calculation on the returned content is odd, but not even. [22:28:28] haha :) [22:29:06] on (a) sounds like we'll need to work together to figure out options without making the object cache balloon or killing the backend. on (b) fortunately the contract doesn't often involve parity bits :) [22:29:44] ? [22:29:46] the good thing is we can work with partners to start whitelisting everything in wikimedia-land. but we can't necessarily ask people to whitelist *all* opera mini traffic [22:29:57] the object cache is fine [22:29:58] dr0ptp4kt: but seriously: at the varnish layer, we should be a) unwrapped trusted XFF's based on things like the whole opera-mini list, and then b) setting a single header that indicates "the client was in the IP range for Carrier-Foo" [22:30:15] everything else doesn't seem to belong at this layer (or in a contract, but that's a whole other matter) [22:30:29] we'll just tag with X-CS with the carrier ip, zero can set the Vary: X-CS if it's a zero page or not set it if it's not [22:30:35] PROBLEM - Disk space on ms-be1003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdd1 is not accessible: Input/output error [22:30:35] PROBLEM - RAID on ms-be1003 is CRITICAL: CRITICAL: 1 failed logical drive(s) (Offline) [22:30:49] blergh [22:32:26] I guess the new raid check worked though [22:32:28] bblack, paravoid, the cache needs to be varied by the X-CS, though. variance and the corresponding object lookup is dependent on the inbound headers. [22:32:39] that's what I said isn't it? [22:35:08] e.g. this in zero.inc.vcl: [...] if (req.http.X-CS2 == "456-02") { if (req.http.X-Subdomain == "ZERO") { set req.http.X-CS = "456-02"; } } [...] [22:36:01] at the zero app layer, you could just do if X-CS2 == 456-02 && X-Sub == "Zero", then set the banner and set Vary: X-CS, otherwise don't send that stuff [22:36:31] well, except we'd just call it all X-CS and not use X-CS2 [22:36:45] nod [22:38:34] RECOVERY - Disk space on ms-be1003 is OK: DISK OK [22:38:45] yeah, we may be onto something there, although i suppose i'll need to write down a plan. our goal is to get all wikimedia projects zero-rated for both opera proxy-sourced traffic and directly-sourced traffic. and for other major proxy sources, i hope, too. [22:39:03] erm [22:39:12] we've been saying this for more than 6 months :) [22:39:16] that was the plan with netmapper all along [22:39:56] paravoid, yep, just reiterating the commitment to it. [22:40:54] we need dfltlang first for this though [22:40:59] I filed a bugzilla for this :) [22:41:08] paravoid, cool [22:42:57] that reminds me, i'm going to ping on the wikimedia ip addresses question the one with subject "request for confirmation on lbs". any help there is most appreciated. [22:43:36] AaronSchulz: hey [22:43:46] AaronSchulz: commons originals is done :) [22:43:55] it needs a rerun obviously [22:44:03] \o/ [22:44:11] en & de too [22:44:14] I need to do the rest [22:44:26] plus transcoded, deleted, temp [22:44:32] render [22:44:35] and last but not least, thumbs [22:44:58] can you/want to start doing multiwrite for originals soonish? [22:45:36] after a syncFileBackend run I presume :) [22:47:17] sure [22:49:36] paravoid: what timestamp did you start swiftrepl? [22:50:43] approx 2013-10-03T23:06:00+00:00 [22:50:58] it's having troubles with a file under .99 now [22:51:16] that sounds vaguely familiar, heh [22:52:35] transferred 582775 out of 590952 for 9/99/Хозяйственная_постройка_Крестьянского_Поземельного_банка.jpg [22:52:37] paravoid: can you do deleted next? [22:52:50] heh, container listing & filesize disagree [22:53:16] AaronSchulz: not non-commons originals before that? [22:53:38] ah, right, you still have to do those [22:53:58] as long as it's originals, deleted, [22:56:13] okay [22:56:18] I reuploaded that one btw [22:56:30] and it got fixed; I have to readd the sha1 b36 header though [22:56:35] Last Modified: Mon, 16 Sep 2013 05:58:10 GMT [22:56:40] so it was recent too [22:56:45] otoh, it was one file over all commons [22:58:43] Destination does not have 9/99/-().jpg, syncing [22:58:43] Destination does not have 9/99/.jpg, syncing [22:58:44] Destination does not have 9/99/.jpg, syncing [22:59:18] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [23:02:41] paravoid: is that just terminal output wonkiness? [23:02:55] hopefully... [23:03:15] hey, look at this: https://commons.wikimedia.org/wiki/File:%D0%A5%D0%BE%D0%B7%D1%8F%D0%B9%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D0%B0%D1%8F_%D0%BF%D0%BE%D1%81%D1%82%D1%80%D0%BE%D0%B9%D0%BA%D0%B0_%D0%9A%D1%80%D0%B5%D1%81%D1%82%D1%8C%D1%8F%D0%BD%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%9F%D0%BE%D0%B7%D0%B5%D0%BC%D0%B5%D0%BB%D1%8C%D0%BD%D0%BE%D0%B3%D0%BE_%D0%B1%D0%B0%D0%BD%D0%BA%D0%B0.jpg [23:03:24] what's with the second thumbnail href? [23:03:42] Hm.. I'm looking to write a script that iterates over all wikis, but it needs to work outside of fenari/tin (e.g. inside a periodic jenkins job). On tin I'd loop over all.dblist and do something like `php eval.php --wiki $wikiid ; echo $wgCanonicalServer` [23:04:10] Meh, I suppose we declare wgCanonicalServer for each wiki, so I can just jank it out of InitialiseSettings [23:04:10] Hope is that thing with feathers / That perches in the soul / And sings the tune without the words / And never stops at all [23:05:11] paravoid: which one? [23:05:21] ohh [23:05:27] I was looking at the list of sizes ;) [23:05:47] that was caused by a race bug [23:06:01] it's been there for ages...t'was fixed last week [23:07:54] oh heh [23:08:55] Extension:BugRacing [23:10:08] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [23:20:37] PROBLEM - DPKG on stafford is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:37] PROBLEM - mysqld processes on db1023 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:20:38] PROBLEM - Disk space on cp1047 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7314 MB (2% inode=99%): /srv/sdb3 6810 MB (2% inode=99%): [23:20:38] PROBLEM - HTTPS on amssq47 is CRITICAL: Connection refused [23:20:47] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:40] i'm about to be moving around european traffic [23:22:46] !log moving european traffic around [23:22:48] moving what to where? :) [23:22:55] just curious [23:23:03] Logged the message, Mistress of the network gear. [23:23:46] oh amsix to esams [23:23:56] but that means everything to transit first [23:24:01] then bringing peering back up [23:24:38] traffic is draining now [23:24:46] !log traffic draining from ams-ix on cr2-knams [23:24:58] Logged the message, Mistress of the network gear. [23:26:24] paravoid, two uploads at the same time [23:26:40] old issue, although it was mentioned in a bug report recently [23:33:40] (03CR) 10TTO: [C: 04-1] "needs rebasing after I4429fb327159b6e71a30c6536ccab0fcf60e6f66" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86418 (owner: 10TTO) [23:35:29] (03CR) 10Bsitu: [C: 04-2] Enable Echo and Thanks on Various wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88246 (owner: 10Bsitu) [23:36:07] RECOVERY - Puppet freshness on williams is OK: puppet ran at Mon Oct 7 23:35:59 UTC 2013 [23:36:09] !log turning up new ams-ix port on cr2-esams [23:36:21] Logged the message, Mistress of the network gear. [23:36:28] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [23:50:39] (03CR) 10Mattflaschen: "Yeah, this is easy to revert if it causes unexpected consequences (like I said at the bug, not tested due to a lack of a suitable environm" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87045 (owner: 10Mattflaschen) [23:53:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [23:53:51] AaronSchulz: did you see how we have swift in graphite now? [23:53:58] via statsd [23:54:01] pmtpa too [23:54:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time