[00:13:34] 06Operations, 06Parsing-Team, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2758922 (10GWicke) [00:14:26] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 48.28% of data above the critical threshold [1800.0] [00:15:26] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [00:30:02] thcipriani|afk: thanks! :D [00:32:47] 06Operations, 06Parsing-Team, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2758953 (10GWicke) All but cxserver now have node 6 test patches pending. @KartikMistry @santhosh, could you test cxserver with Node 6? With https://github.com/creationix/nvm, you can do this... [00:38:57] !grrrit-wm-die [00:39:21] 06Operations, 06Parsing-Team, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2758956 (10Arlolra) [00:42:40] paladox: please stop testing the live production instance of grrrit, it's a important part of our ecosystem [00:45:53] Sorry [00:46:57] p858snake|L2 ^^ i didnt see your message until now since my irc keeps dropping out [00:50:16] p858snake|L2: we dont know to get a new instance [00:51:19] Zppix: then stop playing around with production services and ask for help with that, we have many IRC channels that can provide assistance if you ask and we also have a fantastic installation of phabricator you can utilise [00:52:31] p858snake|L2: i kindly ask you to please not talk down to me we are not 'playing around' with anything we are attempting to make improvements just as you are. [00:54:35] You don't seem to know what you're doing [00:54:50] remember how grrrit-wm always stops talking on every Gerrit restart and everybody is annoyed by thtat, that's what they are fixing [00:55:49] they asked for an instance to test .. on a ticket [00:56:06] But we didnt get it we got a vague create it yourself [00:56:16] And we dont have that access [00:57:35] so if you want to chime in on https://phabricator.wikimedia.org/T149529 [01:01:59] (03PS1) 10Smalyshev: Limit concurrent connections by client IP [puppet] - 10https://gerrit.wikimedia.org/r/319010 (https://phabricator.wikimedia.org/T108488) [01:13:44] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Estimate hardware requirements for WDQS upgrade - https://phabricator.wikimedia.org/T148747#2758973 (10OakleyAlways1) 05stalled>03declined Stop [01:18:02] Spammer ^ [01:18:03] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Estimate hardware requirements for WDQS upgrade - https://phabricator.wikimedia.org/T148747#2758975 (10Smalyshev) 05declined>03stalled [01:18:56] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Estimate hardware requirements for WDQS upgrade - https://phabricator.wikimedia.org/T148747#2758976 (10OakleyAlways1) 05stalled>03Invalid Really stop hacking me [01:18:58] is there somebody who can put some restrains on that OakleyAlways1 ? [01:19:42] greg-g ^ [01:21:16] PROBLEM - puppet last run on elastic1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:21:36] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Estimate hardware requirements for WDQS upgrade - https://phabricator.wikimedia.org/T148747#2758978 (10Smalyshev) 05Invalid>03stalled [01:32:19] they also messed with https://phabricator.wikimedia.org/T146183 [01:40:57] mutante, around? [01:41:07] can you disable that account please? [01:43:39] i'm not a phabricator admin if that's what you mean [01:43:49] You have access to pwstore though right? [01:46:45] mutante, _joe_, akosiaris [01:48:46] RECOVERY - puppet last run on elastic1035 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [01:53:28] fwiw, i dont see a password for that in in pwstore [02:05:18] PROBLEM - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-tools was exit-code [02:18:08] ^ looking [02:21:55] ah of course, labstore2001 is down [02:26:18] !log l10nupdate@tin scap sync-l10n completed (1.28.0-wmf.23) (duration: 09m 08s) [02:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:34] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Nov 1 02:30:34 UTC 2016 (duration 4m 16s) [02:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:52] I disabled the account [02:32:18] Krenair: you could request Phabricator admin rights -- you have a lot of experience fighting abuse [02:34:54] I'm a wikimedia volunteer, there are a *lot* of people around here with that kind of experience [02:35:49] ori, do you happen to know where the credentials to the @admin account live? [02:49:48] Krenair: chad is the one that installed, so he probably has them [02:50:07] we could probably create a new OpsAdmin (or whatever) and add them to pwstore [02:54:26] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:05:07] PROBLEM - Last backup of the others filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-others was exit-code [03:07:23] p858snake|L2_, my understanding was that ops have the credentials to @admin somewhere [03:07:37] it might only be certain ops... I assumed it was in pwstore but apparently not [03:23:16] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [03:23:59] (03CR) 10BBlack: "This will "work", but only in a loose sense. The varnish layers will randomize the traffic into wdqs instances. So, for example, if you " [puppet] - 10https://gerrit.wikimedia.org/r/319010 (https://phabricator.wikimedia.org/T108488) (owner: 10Smalyshev) [03:25:49] (03CR) 10Smalyshev: "That's fine with me, I don't care if they have a lot of connections as long as they don't overload any of the servers. So I don't want the" [puppet] - 10https://gerrit.wikimedia.org/r/319010 (https://phabricator.wikimedia.org/T108488) (owner: 10Smalyshev) [03:30:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 925.68 seconds [03:41:46] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 116.18 seconds [03:51:31] (03PS1) 10Madhuvishy: nfs: Move labstore secondary cluster hiera config to eqiad.yaml [puppet] - 10https://gerrit.wikimedia.org/r/319016 [03:53:05] (03CR) 10Madhuvishy: [C: 032] nfs: Move labstore secondary cluster hiera config to eqiad.yaml [puppet] - 10https://gerrit.wikimedia.org/r/319016 (owner: 10Madhuvishy) [04:05:26] PROBLEM - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-maps was exit-code [05:08:32] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:09:32] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [05:21:37] Krenair: we don't use a shared admin account, we just have users with admin privs: https://phabricator.wikimedia.org/people/query/Cl4VsSngEUn7/#R [05:22:12] greg-g, this list shows a shared admin user though [05:22:20] for use by people not listed there [05:22:45] see last time it did anything, it's not generally in use at all, there's no need [05:22:56] (afaik, there might be something, but it's not frequent) [05:23:05] then we should kill it [05:23:19] * greg-g shrugs [05:23:24] it might have a use, see my last line :) [05:23:36] who has access to it? [05:23:47] I'd have to ask mukunda [05:24:24] anyways, this is a tangent, the point is, there are admins, that query finds them, they have power to disable users [05:24:36] I know all of this [05:25:15] k :) [05:25:19] ok [05:26:33] the description of the admin user should probably be updated at least, though, it seems Ops wouldn't know how to use it accidentally [05:27:01] indeed at least two ops don't know where the password for it is [05:45:30] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:14:20] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:26:20] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:54:10] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:07:25] (03PS1) 10Yurik: LABS: set isLocal=false for tabular data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319024 [07:08:02] (03CR) 10Yurik: [C: 032] LABS: set isLocal=false for tabular data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319024 (owner: 10Yurik) [07:08:30] (03Merged) 10jenkins-bot: LABS: set isLocal=false for tabular data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319024 (owner: 10Yurik) [07:15:19] (03PS1) 10Madhuvishy: nfs: Fix drbd resource definition [puppet] - 10https://gerrit.wikimedia.org/r/319026 [07:16:15] (03CR) 10jenkins-bot: [V: 04-1] nfs: Fix drbd resource definition [puppet] - 10https://gerrit.wikimedia.org/r/319026 (owner: 10Madhuvishy) [07:17:17] 06Operations, 06Parsing-Team, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2759231 (10KartikMistry) Thanks @GWicke I'll test cxserver with Node 6 and update here. [07:29:13] (03PS2) 10Madhuvishy: nfs: Fix drbd resource definition [puppet] - 10https://gerrit.wikimedia.org/r/319026 [07:30:32] (03CR) 10Madhuvishy: [C: 032] nfs: Fix drbd resource definition [puppet] - 10https://gerrit.wikimedia.org/r/319026 (owner: 10Madhuvishy) [07:40:31] (03PS1) 10Arseny1992: Enable OATHAuth on all private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319035 (https://phabricator.wikimedia.org/T149614) [07:41:38] Reedy ^ ;) [07:46:00] 06Operations, 10Parsoid: Deploy failed on wtp2017.codfw.wmnet - https://phabricator.wikimedia.org/T149115#2759255 (10mobrovac) [07:47:05] (03PS1) 10Yurik: LABS: Set $wgJsonConfigInterwikiPrefix correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319036 [07:47:53] Hi, what about creating new right for T149610 ? It's about hiding certains gadgets. The community knows that the abuser can copy it to common.js or somewhere and run it this way or create an if which doesn't allow running without the right. Is it possible from your side? [07:47:54] T149610: New user right and user group for et.wikipedia.org - https://phabricator.wikimedia.org/T149610 [07:48:58] (03CR) 10Yurik: [C: 032] LABS: Set $wgJsonConfigInterwikiPrefix correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319036 (owner: 10Yurik) [07:49:07] Urbanecm see my reply on the task. This needs to have some policy in my opinion [07:49:22] I saw it and I replied. [07:49:25] (03Merged) 10jenkins-bot: LABS: Set $wgJsonConfigInterwikiPrefix correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319036 (owner: 10Yurik) [07:49:49] But what policy do you mean? Our policy? Global policy? Local policy? [07:50:24] (03PS4) 10Muehlenhoff: Also provide imagemagick wrapper in openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/316545 (https://phabricator.wikimedia.org/T145811) [07:51:00] 06Operations, 10Parsoid, 15User-mobrovac: Deploy failed on wtp2017.codfw.wmnet - https://phabricator.wikimedia.org/T149115#2759257 (10mobrovac) a:03mobrovac The problem here are the depooling / repooling scripts used during the deploy. As part of {T145518} we have developed more robust scripts, but these c... [07:51:11] 06Operations, 10Parsoid, 15User-mobrovac: Deploy failed on wtp2017.codfw.wmnet - https://phabricator.wikimedia.org/T149115#2759262 (10mobrovac) p:05Triage>03High [07:55:26] Urbanecm : Global policy about whether wikis are allowed to request such rights changes that cause rights delegation on clases of editors who may be or may not be tech-savvy on the context a right grants to do [07:56:14] This partly discussed in the foundation-l thread I also linked [08:00:13] There is no point of hiding gadgets if they can be accessed sideways such as by copying to own js [08:00:20] (03PS1) 10Madhuvishy: nfs-manage: Fix space trimming in template [puppet] - 10https://gerrit.wikimedia.org/r/319041 [08:04:13] The right/flag/grants "explanation" was mainly for Cumbril [08:16:10] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 37.99 ms [08:16:30] RECOVERY - SSH on labstore2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [08:16:39] madhuvishy: ^ [08:16:51] RECOVERY - DPKG on labstore2001 is OK: All packages OK [08:16:51] RECOVERY - configured eth on labstore2001 is OK: OK - interfaces up [08:16:51] RECOVERY - MD RAID on labstore2001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [08:16:52] RECOVERY - dhclient process on labstore2001 is OK: PROCS OK: 0 processes with command name dhclient [08:17:02] oh wth [08:17:20] yuvipanda: ha ha [08:17:29] madhuvishy: did you not start? [08:17:39] yuvipanda: i did nothing [08:18:47] madhuvishy: haloween spookiness [08:19:44] yuvipanda: :| [08:20:04] it's like we prodded the giant and it decided to wake up [08:20:14] madhuvishy: yeah [08:20:20] my suspicion is that your console com2 did it [08:21:22] may be [08:22:19] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2759297 (10madhuvishy) I tried now - madhuvishy@puppetmaster1001:~$ ssh root@labstore2001.mgmt.codfw.wmnet root@labstore2001.mgmt.codfw.wmnet's password: /admin1-> console com2 console: Serial Device 2 is current... [08:22:23] madhuvishy: but yeah, it seems to be up [08:22:31] yuvipanda: yup, ssh and all [08:23:00] * yuvipanda nods [08:23:39] (03CR) 10Madhuvishy: [C: 032] nfs-manage: Fix space trimming in template [puppet] - 10https://gerrit.wikimedia.org/r/319041 (owner: 10Madhuvishy) [08:23:52] madhuvishy: I fixed it up, current commenting on the bug [08:24:04] moritzm: oh thank god [08:24:27] thank you :D [08:24:41] yuvipanda: ^ [08:26:57] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2759299 (10Peachey88) [08:27:06] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2759300 (10MoritzMuehlenhoff) I logged in over the mgmt and the network interface was down, there's no clue that happened, though. I started the networked services manually and removed /run/nologin (which prevented... [08:28:03] the host had no network connectivity, so logins were only possible via the locally stored root password [08:28:39] moritzm: ah, right - that's what i was going to try when it said console busy [08:28:56] and then everything came up, and i was extremely confused :) [08:31:20] BTW, Papaul mentioned a broken disk, this will probably flag with the next Icinga run [08:32:39] moritzm: right [08:40:35] 06Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670#2759306 (10MoritzMuehlenhoff) [08:53:03] arseny92: Sorry for my lateness, I didn't noticed your replay. I know gadgets could be easily copied (and if needed the condition which disallows running without correct right could be removed) but how many users can do it. [08:54:32] And abuse filters can be bypassed too by some ways, e.g. guessing the conditions if filter is hidden or creating edit which doesn't hit any filter but we do create them. Because they make vandalism harder. [08:54:40] RECOVERY - salt-minion processes on phab2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:54:56] 06Operations, 10ops-eqiad, 10media-storage: ms-be1001 - disk failure /dev/sdf1 - https://phabricator.wikimedia.org/T149073#2759324 (10MoritzMuehlenhoff) a:03Cmjohnson [08:56:06] 06Operations, 10ops-eqiad, 10media-storage: ms-be1005 - MegaRAID - CRITICAL: 1 failed LD(s) (Offline) - https://phabricator.wikimedia.org/T149069#2759325 (10MoritzMuehlenhoff) a:03Cmjohnson [08:56:21] 06Operations, 10ops-codfw: RAID degraded on ms-be2011 - https://phabricator.wikimedia.org/T149234#2759326 (10MoritzMuehlenhoff) a:03Papaul [08:58:28] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2758740 (10MoritzMuehlenhoff) @Volans: Can you send a mail to the ops@ list? Otherwise people will miss the task. [09:03:55] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2759331 (10MoritzMuehlenhoff) a:03Papaul [09:05:57] (03PS2) 10Alexandros Kosiaris: icinga: Add comments about paging infrastructure update [puppet] - 10https://gerrit.wikimedia.org/r/318955 [09:06:36] (03PS3) 10Alexandros Kosiaris: icinga: Add comments about paging infrastructure update [puppet] - 10https://gerrit.wikimedia.org/r/318955 [09:06:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Add comments about paging infrastructure update [puppet] - 10https://gerrit.wikimedia.org/r/318955 (owner: 10Alexandros Kosiaris) [09:07:02] (03PS4) 10Alexandros Kosiaris: tendril: Supply a robots.txt disallow all robots [puppet] - 10https://gerrit.wikimedia.org/r/318900 (https://phabricator.wikimedia.org/T149340) [09:07:09] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] tendril: Supply a robots.txt disallow all robots [puppet] - 10https://gerrit.wikimedia.org/r/318900 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [09:17:14] (03PS1) 10Gehel: Revert "cirrus - disable the rebuild of completion indices" [puppet] - 10https://gerrit.wikimedia.org/r/319043 [09:18:53] (03CR) 10Gehel: [C: 032] Revert "cirrus - disable the rebuild of completion indices" [puppet] - 10https://gerrit.wikimedia.org/r/319043 (owner: 10Gehel) [09:21:39] (03PS2) 10Gehel: elasticsearch - enable GC logs by default [puppet] - 10https://gerrit.wikimedia.org/r/318353 (https://phabricator.wikimedia.org/T134853) [09:22:44] (03PS7) 10Alexandros Kosiaris: icinga: Specify mode for nagios_host, nagios_service [puppet] - 10https://gerrit.wikimedia.org/r/317791 [09:22:46] (03PS8) 10Alexandros Kosiaris: icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 [09:22:48] (03PS1) 10Alexandros Kosiaris: icinga: Always display all results in web interface [puppet] - 10https://gerrit.wikimedia.org/r/319045 [09:22:50] (03CR) 10Gehel: [C: 032] elasticsearch - enable GC logs by default [puppet] - 10https://gerrit.wikimedia.org/r/318353 (https://phabricator.wikimedia.org/T134853) (owner: 10Gehel) [09:33:26] 06Operations: Reimage/rename codfw pool counters - https://phabricator.wikimedia.org/T149298#2759361 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:33:48] 06Operations, 06Discovery-Search (Current work): Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2759363 (10Gehel) [09:33:50] 06Operations, 06Discovery-Search (Current work), 13Patch-For-Review, 07Wikimedia-Incident: Enable GC (garbage collection) logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853#2759362 (10Gehel) 05Open>03Resolved [09:36:21] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Install a docker registry for production - https://phabricator.wikimedia.org/T148960#2759376 (10MoritzMuehlenhoff) p:05Triage>03High [09:37:44] (03PS8) 10Alexandros Kosiaris: icinga: Specify mode for nagios_host, nagios_service [puppet] - 10https://gerrit.wikimedia.org/r/317791 [09:37:46] (03PS2) 10Alexandros Kosiaris: icinga: Always display all results in web interface [puppet] - 10https://gerrit.wikimedia.org/r/319045 [09:37:48] (03PS9) 10Alexandros Kosiaris: icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 [09:37:50] (03PS1) 10Alexandros Kosiaris: icinga: switch tegmen and einsteinium roles [puppet] - 10https://gerrit.wikimedia.org/r/319047 [09:39:42] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Always display all results in web interface [puppet] - 10https://gerrit.wikimedia.org/r/319045 (owner: 10Alexandros Kosiaris) [09:47:16] 06Operations, 06Parsing-Team, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2759379 (10mobrovac) [09:48:54] 06Operations, 06Parsing-Team, 06Services (later), 15User-mobrovac: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2748980 (10mobrovac) [09:49:18] (03PS1) 10Gehel: elasticsearch: /etc/elasticsearch/scripts is not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/319048 [09:53:31] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: switch tegmen and einsteinium roles [puppet] - 10https://gerrit.wikimedia.org/r/319047 (owner: 10Alexandros Kosiaris) [09:57:31] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 6 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2759397 (10mobrovac) [10:02:55] (03PS2) 10Ema: cache_text varnishtest: beacon and CP [puppet] - 10https://gerrit.wikimedia.org/r/318946 (https://phabricator.wikimedia.org/T131503) [10:03:03] (03CR) 10Ema: [C: 032 V: 032] cache_text varnishtest: beacon and CP [puppet] - 10https://gerrit.wikimedia.org/r/318946 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [10:06:01] !log installing openjdk security fixes on restbase2, rolling restart of cassandra [10:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:50] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2759431 (10Peachey88) [10:20:35] PROBLEM - cassandra-b SSL 10.192.16.163:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:21:25] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:21:35] RECOVERY - cassandra-b SSL 10.192.16.163:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-b valid until 2017-09-12 15:13:28 +0000 (expires in 315 days) [10:30:05] PROBLEM - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is CRITICAL: connect to address 10.192.16.165 and port 9042: Connection refused [10:31:34] RECOVERY - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is OK: TCP OK - 0.000 second response time on 10.192.16.165 port 9042 [10:42:07] RECOVERY - PyBal backends health check on lvs1008 is OK: PYBAL OK - All pools are healthy [10:42:25] that's erroneous btw ^ [10:42:35] it's just cause tegmen is now the primary and not einsteinium [10:42:51] lvs1007 and lvs1009 will report that as well [10:43:45] (03PS1) 10Alexandros Kosiaris: switch over einsteinium to tegmen [dns] - 10https://gerrit.wikimedia.org/r/319051 [10:44:04] (03PS2) 10Alexandros Kosiaris: switch over einsteinium to tegmen [dns] - 10https://gerrit.wikimedia.org/r/319051 [10:44:08] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] switch over einsteinium to tegmen [dns] - 10https://gerrit.wikimedia.org/r/319051 (owner: 10Alexandros Kosiaris) [10:46:24] (03CR) 10Ema: Replace check_sslxNN with check_ssl_unified (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318971 (owner: 10BBlack) [10:48:47] PROBLEM - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:49:47] RECOVERY - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-b valid until 2017-09-12 15:35:15 +0000 (expires in 315 days) [10:54:34] !log rebooting einsteinium [10:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:41] (03PS9) 10BBlack: Replace check_sslxNN with check_ssl_unified [puppet] - 10https://gerrit.wikimedia.org/r/318971 [11:01:43] (03PS1) 10BBlack: sort nagios command lists [puppet] - 10https://gerrit.wikimedia.org/r/319052 [11:02:12] (03CR) 10BBlack: [C: 032 V: 032] sort nagios command lists [puppet] - 10https://gerrit.wikimedia.org/r/319052 (owner: 10BBlack) [11:04:54] !log rolling restart of cassandra on maps-test* for jvm upgrade [11:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:43] !log rolling restart of cassandra on maps2* for jvm upgrade [11:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:43] PROBLEM - kartotherian endpoints health on maps2001 is CRITICAL: /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200) [11:19:10] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /{src}/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200) [11:19:43] RECOVERY - kartotherian endpoints health on maps2001 is OK: All endpoints are healthy [11:19:53] karthoterian issue is probably me and cassandra restart, checking [11:20:08] RECOVERY - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy [11:25:33] (03PS1) 10Rush: labstore: nfs-manage patches [puppet] - 10https://gerrit.wikimedia.org/r/319053 [11:25:49] (03PS2) 10Rush: labstore: nfs-manage patches [puppet] - 10https://gerrit.wikimedia.org/r/319053 [11:27:24] (03CR) 10Rush: [C: 032] labstore: nfs-manage patches [puppet] - 10https://gerrit.wikimedia.org/r/319053 (owner: 10Rush) [11:54:13] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [11:55:13] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3071987 keys, up 1 days 3 hours - replication_delay is 0 [12:01:09] !log upgrading cache_text nginx => 1.11.4-1+wmf13 [12:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:40] (03PS1) 10Jdrewniak: Bumping portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319054 (https://phabricator.wikimedia.org/T128546) [12:05:01] !log upgrading wtp1001 to nodejs 4.6 [12:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:25] (03CR) 10Reedy: [C: 04-1] Enable OATHAuth on all private wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319035 (https://phabricator.wikimedia.org/T149614) (owner: 10Arseny1992) [12:17:13] (03CR) 10Faidon Liambotis: [C: 032] Replace check_sslxNN with check_ssl_unified [puppet] - 10https://gerrit.wikimedia.org/r/318971 (owner: 10BBlack) [12:17:33] (03PS2) 10Faidon Liambotis: nagios: do both RSA/ECDSA checks in check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/318949 [12:18:15] (03CR) 10Faidon Liambotis: [C: 032] keyholder: be systemd compatible [puppet] - 10https://gerrit.wikimedia.org/r/318941 (https://phabricator.wikimedia.org/T148273) (owner: 10Volans) [12:18:17] (03CR) 10Faidon Liambotis: [C: 032] keyholder: fix flake8 [puppet] - 10https://gerrit.wikimedia.org/r/318942 (https://phabricator.wikimedia.org/T148273) (owner: 10Volans) [12:20:44] (03CR) 10Faidon Liambotis: [C: 032] keyholder: add support for SHA256 key fingerprints [puppet] - 10https://gerrit.wikimedia.org/r/318943 (https://phabricator.wikimedia.org/T148273) (owner: 10Volans) [12:23:44] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 6 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2759571 (10faidon) >>! In T149331#2758280, @GWicke wrote: > So, here is a proposal: > > - Double-check Node 6 support for all production services. We have been testing major... [12:27:40] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 6 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2759582 (10KartikMistry) @GWicke Manually testing using experimental packages is fine? ``` sudo apt -t experimental install nodejs ``` [12:28:17] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 9 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2759583 (10KartikMistry) [12:29:58] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 9 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2759586 (10MoritzMuehlenhoff) @KartikMistry : Yeah, should be fine. That's (module some jessie backport changes) what we'll be using in production as well. [12:31:08] (03PS1) 10Rush: labstore: tc-setup new classes [puppet] - 10https://gerrit.wikimedia.org/r/319057 [12:31:35] (03PS2) 10Rush: labstore: tc-setup new classes [puppet] - 10https://gerrit.wikimedia.org/r/319057 [12:32:47] !log mgmt powercycle of labstore1004 [12:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:44] (03CR) 10Rush: [C: 032] labstore: tc-setup new classes [puppet] - 10https://gerrit.wikimedia.org/r/319057 (owner: 10Rush) [12:44:43] (03PS2) 10Arseny1992: Enable OATHAuth on all private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319035 (https://phabricator.wikimedia.org/T149614) [12:47:04] !log rolling restart of cassandra on maps1* for jvm upgrade [12:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:09] 06Operations: Cron spam caused ieee-data cron job - https://phabricator.wikimedia.org/T149681#2759599 (10MoritzMuehlenhoff) [12:47:33] 06Operations: Cron spam caused by ieee-data cron job - https://phabricator.wikimedia.org/T149681#2759599 (10MoritzMuehlenhoff) [12:53:11] 06Operations, 06Labs: cronspam from labstores, labcontrol, labstestservices - https://phabricator.wikimedia.org/T149574#2759635 (10MoritzMuehlenhoff) The "Cron /usr/bin/rsync --delete --delete-after -aSO /srv/glance//images/ labcontrol1002.wikimedia.org:/srv/glance//images/" message... [12:53:42] (03PS1) 10Rush: labstore: 'other' is really misc-project [puppet] - 10https://gerrit.wikimedia.org/r/319061 [12:53:55] (03PS2) 10Rush: labstore: 'other' is really misc-project [puppet] - 10https://gerrit.wikimedia.org/r/319061 [12:55:03] (03PS10) 10BBlack: Replace check_sslxNN with check_ssl_unified [puppet] - 10https://gerrit.wikimedia.org/r/318971 [12:55:49] (03CR) 10BBlack: [V: 032] Replace check_sslxNN with check_ssl_unified [puppet] - 10https://gerrit.wikimedia.org/r/318971 (owner: 10BBlack) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161101T1300). [13:00:04] jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:15] (03PS1) 10Alexandros Kosiaris: icinga: Increase the check intervals [puppet] - 10https://gerrit.wikimedia.org/r/319062 [13:00:31] Hello [13:00:52] Is Jan here? [13:03:34] (03CR) 10Rush: [C: 032] labstore: 'other' is really misc-project [puppet] - 10https://gerrit.wikimedia.org/r/319061 (owner: 10Rush) [13:03:37] (03PS3) 10Rush: labstore: 'other' is really misc-project [puppet] - 10https://gerrit.wikimedia.org/r/319061 [13:03:41] (03CR) 10Rush: [V: 032] labstore: 'other' is really misc-project [puppet] - 10https://gerrit.wikimedia.org/r/319061 (owner: 10Rush) [13:05:00] hi jan_drewniak [13:05:12] jan_drewniak: so you've a patch for SWAT? [13:05:26] Dereckson: hi there, was just about to ask, yes I did [13:06:29] This week, DST is already observed in UE, but not yet in North America, so SWAT are "one hour sooner" from UE point of view. [13:06:57] s/is already/is no longer [13:09:18] Dereckson: wait so the SWAT ran 1 hour ago? [13:09:38] no, no, it's right now [13:10:05] I'm going to deploy your change. [13:10:58] 06Operations, 10OTRS: Intermittent 503 errors on OTRS ticket system when sending responses to tickets - https://phabricator.wikimedia.org/T148299#2719516 (10akosiaris) @Josve05a, can't say I can reproduce. Is this still true ? [13:11:22] ok thanks :) re: DST I just look at the time on deployment page and figure the script gives me the correct time in my timezone [13:12:18] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319054 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:13:04] (03Merged) 10jenkins-bot: Bumping portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319054 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:14:17] yurik merged but didn't deploy two labs change [13:14:26] b1d1ec4be9627bf790f445c803389abfccd46d26 LABS: set isLocal=false for tabular data [13:14:45] e631684b4b738b7fae673794bf65ab17ddf68dc8 LABS: Set $wgJsonConfigInterwikiPrefix correctly [13:14:46] (03PS1) 10BBlack: check_ssl: don't report full SAN list on success [puppet] - 10https://gerrit.wikimedia.org/r/319068 [13:15:00] Dereckson: He should've... [13:15:00] 18:55 yurik@tin: Synchronized wmf-config: labs syncup https://gerrit.wikimedia.org/r/#/c/318883 (duration: 00m 49s) [13:15:05] (03CR) 10BBlack: [C: 032 V: 032] check_ssl: don't report full SAN list on success [puppet] - 10https://gerrit.wikimedia.org/r/319068 (owner: 10BBlack) [13:15:10] (03PS2) 10BBlack: check_ssl: don't report full SAN list on success [puppet] - 10https://gerrit.wikimedia.org/r/319068 [13:15:12] (03CR) 10BBlack: [V: 032] check_ssl: don't report full SAN list on success [puppet] - 10https://gerrit.wikimedia.org/r/319068 (owner: 10BBlack) [13:16:08] Reedy: that was 2016-10-31 [13:16:26] Dereckson: yes, yesterday [13:16:42] Unless he merged more after? [13:17:04] 07:08:01 < grrrit-wm> (CR) Yurik: [C: 2] LABS: set isLocal=false for tabular data [mediawiki-config] - https://gerrit.wikimedia.org/r/319024 (owner: [13:17:07] Yurik) [13:17:09] 07:08:29 < grrrit-wm> (Merged) jenkins-bot: LABS: set isLocal=false for tabular data [mediawiki-config] - https://gerrit.wikimedia.org/r/319024 [13:17:12] (owner: Yurik) [13:17:43] :( [13:17:45] Reedy: yes this morning [13:18:30] if they just touch labs files, just deploy them [13:18:34] Leave a comment on the changeset [13:18:34] jan_drewniak: I'll sync first labs change for yurik, then I'll take care of yours [13:18:46] ok [13:23:33] (03CR) 10Dereckson: "Merged to master, but not deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319024 (owner: 10Yurik) [13:23:58] (03CR) 10Dereckson: "Merged to master, but not deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319036 (owner: 10Yurik) [13:27:05] and morebots is missing [13:27:15] !log Synchronized wmf-config/CommonSettings-labs.php: Labs: fix $wgJsonConfigInterwikiPrefix and set isLocal=false for tabular data ([[Gerrit:319024]] + [[Gerrit:319036]], no-op in prod) (duration: 00m 57s) [13:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:04] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:16] jan_drewniak: portal change live on mw1099 [13:28:55] Dereckson: looks good to me [13:29:14] ok, syncing [13:30:04] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [13:31:20] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2759693 (10Anomie) My philosophical standpoint: It's good when MediaWiki has a simple default way to do things in... [13:31:27] PHP Notice: Undefined variable: wmgUseGraphWithNamespace in /srv/mediawiki/wmf-config/CommonSettings.php on line 3084 [13:31:46] (probably from the purge script) [13:31:56] (03PS3) 10Faidon Liambotis: nagios: do both RSA/ECDSA checks in check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/318949 [13:31:57] jan_drewniak: check if that worked, I got some errors during the sync, but script has completed [13:32:06] !log sync-portal: Synchronized portals/, purged URLs ([[Gerrit:319054]]) [13:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:19] (03CR) 10Faidon Liambotis: [C: 032] nagios: do both RSA/ECDSA checks in check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/318949 (owner: 10Faidon Liambotis) [13:32:29] Dereckson: It looks like it worked :) [13:32:52] ah the error is the timeout for dologmsg [13:33:06] now we've the Undefined variable: wmgUseGraphWithNamespace in /srv/mediawiki/wmf-config/CommonSettings.php on line 3084 issue [13:33:19] (03PS4) 10Faidon Liambotis: nagios: do both RSA/ECDSA checks in check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/318949 [13:33:39] probably introduced by 26214780513153593c0f1d6e5b89b6ec592f07ff Removed unused wmgUseGraphWithNamespace support [13:34:14] Reedy . is PS2 ok? If so we can deploy that. [13:34:35] arseny92: Reedy: wait a moment please [13:34:43] (03CR) 10Faidon Liambotis: "check_sslxNN is not used anymore and the other half of this patch was merged before — but add this for posterity and in case it ever becom" [puppet] - 10https://gerrit.wikimedia.org/r/318949 (owner: 10Faidon Liambotis) [13:35:37] wmgUseGraphWithNamespace has really been removed everywhere in wmf-config/ so I guess it's cache [13:37:32] Synchronized wmf-config/CommonSettings.php: SWAT: Removed unused wmgUseGraphWithNamespace support PART I (duration: 00m 47s) Synchronized wmf-config/InitialiseSettings.php: SWAT: Removed unused wmgUseGraphWithNamespace support PART II (duration: 00m 45s) [13:37:58] all looks good, probably some code still in cache [13:38:53] Dereckson , morebots is not here for SAL [13:38:57] arseny92: so? [13:39:34] jan_drewniak: I see sync-portals offers to run mwscript purgeList.php on deployment server directly [13:40:31] jan_drewniak: I wonder if that's a good idea: currently /srv/mediawiki @ tin isn't up to date [13:40:59] Dereckson , so the above sync message is not logged [13:41:09] arseny92: what? [13:41:22] Dereckson , morebots is not here for SAL [13:41:27] arseny92: it's a copy/paste from the SAL, not something I want to log [13:41:33] arseny92: I wouldn't just want to deploy that yet [13:41:35] (03PS1) 10Muehlenhoff: Set conntrack table max size via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/319071 [13:41:36] uh [13:41:40] Also, needs DB tables creating on all private wikis [13:42:38] other than that is the change good? [13:42:55] Reedy: would you know how/when /srv/mediawiki/ *on Tin* is updated? [13:43:07] arseny92: I've not checked it :) [13:43:23] Dereckson: It should be the first step/part of sync-comasters [13:43:28] Before syncing to scap proxies [13:44:21] Reedy: it doesn't seem to work currently, at leastfor /srv/mediawiki/wmf-config/CommonSettings.php [13:44:47] does sync-common on tin do it? [13:45:13] 13:44:55 Copying to tin.eqiad.wmnet from deployment.eqiad.wmnet, 13:44:55 Started rsync common, let's see [13:46:24] [15:44] (PS1) Mobrovac: Revert "Deployment: use mira instead of tin" [services/change-propagation/deploy] - https://gerrit.wikimedia.org/r/319072 [13:46:24] [15:44] (CR) Mobrovac: [C: 2 V: 2] Revert "Deployment: use mira instead of tin" [services/change-propagation/deploy] - https://gerrit.wikimedia.org/r/319072 (owner: Mobrovac) [13:47:15] 06Operations, 10ops-eqiad: Add thermal paste to einsteinium - https://phabricator.wikimedia.org/T149685#2759705 (10Cmjohnson) [13:47:54] arseny92: stashbot does SAL logging now afaik [13:47:55] arseny92: we've two deployement servers, tin and mira ; generally, we use tin, but when tin must be updated, deployment activity switches to mira [13:49:08] arseny92: then, when update is done, that switches again to tin, the commit message you've copy/pasted is the mira → tin switch (as a revert of the tin → mira switch) [13:52:35] yes but you just said currently /srv/mediawiki @ tin isn't up to date so why the switch to use tin [13:52:39] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [13:53:26] "out of date" is subjective [13:53:30] !log scap pull @ tin to sync /srv/mediawiki locally [13:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:56] now up to date lol [13:53:59] 06Operations, 10OTRS: Intermittent 503 errors on OTRS ticket system when sending responses to tickets - https://phabricator.wikimedia.org/T148299#2759753 (10Josve05a) >>! In T148299#2759665, @akosiaris wrote: > @Josve05a, can't say I can reproduce. Is this still true ? It has declined in numbers and frequency,... [13:57:31] jan_drewniak: to avoid any further error, I'd prefer we do something like that for sync-portals: https://gerrit.wikimedia.org/r/319076 [14:01:31] Dereckson: ok... so that means deployers should purge the url's on a different server? [14:01:57] Dereckson: There's another way to do this [14:02:08] 06Operations, 10OTRS: Intermittent 503 errors on OTRS ticket system when sending responses to tickets - https://phabricator.wikimedia.org/T148299#2759757 (10akosiaris) @Josve05a OK. Would you be so kind as to provide me with a failed request logs next time it happens, so I can debug it ? Something like the stu... [14:03:30] Dereckson: There used to be a way to tell mwscript to use the staging dir [14:03:42] Though [14:03:42] cat /srv/mediawiki-staging/portals/urls-to-purg [14:03:49] I'm not sure why that's gonna be wrong? [14:08:15] !log created OATHAuth tables on all private wikis [14:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:14] logmsgbot: I don't see you logging [14:15:37] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2759780 (10Zppix) >>! In T149609#2758578, @Dzahn wrote: > And how would that be triggered from gerrit, when the whole point of ne... [14:16:37] (03PS1) 10Rush: labstore: keep nfs-kernel-server management in nfs-manage [puppet] - 10https://gerrit.wikimedia.org/r/319081 [14:16:44] (03PS2) 10Rush: labstore: keep nfs-kernel-server management in nfs-manage [puppet] - 10https://gerrit.wikimedia.org/r/319081 [14:16:59] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [14:18:24] (03PS9) 10Alexandros Kosiaris: icinga: Specify mode for nagios_host, nagios_service [puppet] - 10https://gerrit.wikimedia.org/r/317791 [14:18:26] (03PS10) 10Alexandros Kosiaris: icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 [14:18:28] (03PS1) 10Alexandros Kosiaris: Use icinga.wikimedia.org instead of einsteinium.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/319082 [14:19:00] (03PS1) 10Ema: SystemTap Puppet module and role::systemtap::devserver [puppet] - 10https://gerrit.wikimedia.org/r/319083 [14:19:17] (03CR) 10Rush: [C: 032] labstore: keep nfs-kernel-server management in nfs-manage [puppet] - 10https://gerrit.wikimedia.org/r/319081 (owner: 10Rush) [14:20:29] (03PS2) 10Alexandros Kosiaris: Use icinga.wikimedia.org instead of einsteinium.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/319082 [14:20:33] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Use icinga.wikimedia.org instead of einsteinium.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/319082 (owner: 10Alexandros Kosiaris) [14:20:50] (03PS11) 10Alexandros Kosiaris: icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 [14:20:54] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 (owner: 10Alexandros Kosiaris) [14:21:03] let's see what happens now [14:21:32] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2759801 (10Zppix) [14:30:03] 06Operations, 10grrrit-wm: Whitelist for grrrit-wm - https://phabricator.wikimedia.org/T149689#2759805 (10Zppix) [14:30:06] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 6 failures [14:35:00] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [14:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:06] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 6 failures [14:35:56] (03PS2) 10Muehlenhoff: Create a separate sysctl configuration for setting connection tracking settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 [14:37:08] (03CR) 10jenkins-bot: [V: 04-1] Create a separate sysctl configuration for setting connection tracking settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [14:40:07] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 6 failures [14:40:16] (03PS3) 10Muehlenhoff: Create a separate sysctl configuration for setting connection tracking settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 [14:41:57] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2759854 (10Papaul) p:05Triage>03Normal a:03Papaul [14:43:20] Reedy: defining MEDIAWIKI_STAGING_DIR=/srv/mediawiki-staging [14:43:40] looking at lutetium... [14:43:41] mwscript checks if this variable exists and is a valid directory, if so it uses its value [14:45:06] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [14:45:58] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2759865 (10Papaul) 05Open>03Resolved I chat with Moritz on IRC he mentioned that it is okay to resolve this task. [14:50:57] !log Local-hacking some JavaScript changes on mw1099 to debug T146510 [14:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:03] T146510: Investigate async cache-eval load-time regression (2016-09-08) - https://phabricator.wikimedia.org/T146510 [14:51:29] marostegui: GM is there anything you need me to do on https://phabricator.wikimedia.org/T149099? [14:56:29] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2759882 (10Papaul) @Jgreen DO we have any other logs for this? HP will request that. Thanks. [15:01:45] !log upgrading/rolling restart of remaining wtp nodes in eqiad to nodejs 4.6 [15:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:23] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2759925 (10Jgreen) Hopefully this is all they need: root@payments2002:~# hpssacli ctrl slot=1 pd 2I:1:2 show detail Smart Array P222 in Slot 1 array A physicaldrive 2I:1:2... [15:06:15] (03PS1) 10Andrew Bogott: nova_fixed_multi: Change a bunch of debug messages to warnings [puppet] - 10https://gerrit.wikimedia.org/r/319090 (https://phabricator.wikimedia.org/T115194) [15:07:16] (03CR) 10jenkins-bot: [V: 04-1] nova_fixed_multi: Change a bunch of debug messages to warnings [puppet] - 10https://gerrit.wikimedia.org/r/319090 (https://phabricator.wikimedia.org/T115194) (owner: 10Andrew Bogott) [15:09:09] (03PS2) 10Andrew Bogott: nova_fixed_multi: Change a bunch of debug messages to warnings [puppet] - 10https://gerrit.wikimedia.org/r/319090 (https://phabricator.wikimedia.org/T115194) [15:10:09] (03CR) 10jenkins-bot: [V: 04-1] nova_fixed_multi: Change a bunch of debug messages to warnings [puppet] - 10https://gerrit.wikimedia.org/r/319090 (https://phabricator.wikimedia.org/T115194) (owner: 10Andrew Bogott) [15:11:13] (03PS3) 10Andrew Bogott: nova_fixed_multi: Change a bunch of debug messages to warnings [puppet] - 10https://gerrit.wikimedia.org/r/319090 (https://phabricator.wikimedia.org/T115194) [15:13:23] (03CR) 10Andrew Bogott: [C: 032] nova_fixed_multi: Change a bunch of debug messages to warnings [puppet] - 10https://gerrit.wikimedia.org/r/319090 (https://phabricator.wikimedia.org/T115194) (owner: 10Andrew Bogott) [15:13:54] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2759936 (10Papaul) 05Resolved>03Open a:05Papaul>03None Re-opening this task since disk in slot 5 on labstore2001 is bad. [15:19:45] 06Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670#2759962 (10Papaul) @jcrespo can you please give me a detail log on this like you did for T149377? Thanks. [15:20:13] (03PS5) 10Mdann52: Localisation of Babel categories on nap.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) [15:20:24] (03CR) 10Nikerabbit: [C: 031] Localisation of Babel categories on nap.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) (owner: 10Mdann52) [15:23:17] 06Operations, 10ops-codfw: RAID degraded on ms-be2011 - https://phabricator.wikimedia.org/T149234#2759988 (10Papaul) I have a task open to order disk for the system . T149693 [15:24:25] (03CR) 10Dereckson: [C: 04-1] "Configuration is correct, but this lacks local community support." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) (owner: 10Mdann52) [15:29:14] 06Operations, 10ops-codfw: RAID degraded on ms-be2011 - https://phabricator.wikimedia.org/T149234#2760040 (10Papaul) p:05Triage>03Normal [15:29:32] 06Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670#2760042 (10Papaul) p:05Triage>03Normal [15:29:59] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2760045 (10Papaul) p:05Triage>03Normal [15:35:45] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 9 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2760054 (10GWicke) [15:41:43] !log Installed nmap on iron [15:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:37] (03PS1) 10Arseny1992: Enable WikiLove extension on Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319095 (https://phabricator.wikimedia.org/T149683) [15:47:01] (03PS1) 10Muehlenhoff: Update to 4.4.29 (and the regression release 4.4.30) [debs/linux44] - 10https://gerrit.wikimedia.org/r/319096 [15:47:44] Deployers, does scap etc automatically run update.php to create extension db tables etc? Or how's that usually done when doing extensions enablements as per above? [15:48:20] No [15:48:31] Production does not use update.php at all [15:48:35] It's done as a semi manual process [15:49:26] Reedy then how? I like was reading the extension manpage and seeing that it needs tables. [15:49:27] (03PS1) 10Faidon Liambotis: Add asw2-d-eqiad [dns] - 10https://gerrit.wikimedia.org/r/319097 [15:49:37] !log Created wikilove tables on bnwikisource for T149683 [15:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:43] T149683: Enable WikiLove extension on Bengali Wikisource - https://phabricator.wikimedia.org/T149683 [15:49:47] (03CR) 10Faidon Liambotis: [C: 032] Add asw2-d-eqiad [dns] - 10https://gerrit.wikimedia.org/r/319097 (owner: 10Faidon Liambotis) [15:50:02] oh ;) [15:50:24] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2760132 (10Papaul) @Jgreen Yes thank you. [15:50:42] arseny92: for your reference, we generally use the createExtensionTables.php from the WikimediaMaintenance extension [15:51:58] (03PS1) 10Faidon Liambotis: netops (etc.): add asw2-d-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/319098 [15:52:16] I think the only reason we wouldn't is if the extension hasn't yet been added to that script [15:52:22] Mmm [15:52:28] echo used to be a problem with it's x1 tables but that was fixed [15:52:37] And if it's gonna be done as an adhoc thing, it's easier to add [15:52:44] Hence me adding OATHAuth for my laziness [15:53:05] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2760141 (10Cmjohnson) @fgiunchedi spent the morning on the phone with HP....good news bad news. they're sending me a new system board, backplane and 2 new ssds...bad news, system board... [15:53:46] Good to know ;) [15:55:38] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2760146 (10ArielGlenn) Preliminary findings from the logs for about 1 week: - There were 238 full GCs or an averag... [15:58:07] !log checking all serial cables to row D in eqiad. [15:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:41] !log graphite-carbon restart after merging https://gerrit.wikimedia.org/r/#/c/316810/ [15:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:47] (03PS4) 10Filippo Giunchedi: graphite: change Cassandra '.count' metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/316810 (https://phabricator.wikimedia.org/T121789) [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161101T1600). Please do the needful. [16:00:21] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2760161 (10RobH) We can escalate this issue to Dasher, they may be able to expedite the system board replacement. Do you have a case # I can pass along to Dasher? [16:00:47] Scheduled ;) [16:00:49] nothing lined up for puppet swat [16:01:38] puppet swat done before even started [16:01:57] unless someone just adds in last moment ;) [16:02:09] PROBLEM - thumbor@8823 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8823 is inactive [16:02:49] (03CR) 10Filippo Giunchedi: [C: 032] graphite: change Cassandra '.count' metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/316810 (https://phabricator.wikimedia.org/T121789) (owner: 10Filippo Giunchedi) [16:07:09] RECOVERY - thumbor@8823 service on thumbor1001 is OK: OK - thumbor@8823 is active [16:09:32] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [16:09:49] graphite is me, fix incoming [16:10:00] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:10:49] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 2 minutes ago with 5 failures. Failed resources (up to 3 shown): Service[carbon-cache@b],Service[carbon-cache@g],Service[carbon-cache@e],Service[carbon-cache@f] [16:11:01] (03PS1) 10Filippo Giunchedi: graphite: s/avg/average/ for aggregationMethod [puppet] - 10https://gerrit.wikimedia.org/r/319100 [16:11:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: s/avg/average/ for aggregationMethod [puppet] - 10https://gerrit.wikimedia.org/r/319100 (owner: 10Filippo Giunchedi) [16:13:29] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [16:20:09] RECOVERY - check_raid on payments2002 is OK: OK: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 OK] [16:22:08] ACKNOWLEDGEMENT - HP RAID on db2052 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor - Failed: 1I:1:4 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T149703 [16:22:12] 06Operations, 10ops-codfw: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149703#2760219 (10ops-monitoring-bot) [16:24:47] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:25:07] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:27:09] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2760265 (10KartikMistry) [16:34:31] (03CR) 10Filippo Giunchedi: [C: 031] "@ori fair enough, checking freshness could come handy for some metrics we're collecting now" [puppet] - 10https://gerrit.wikimedia.org/r/251675 (owner: 10Ori.livneh) [16:36:20] 06Operations, 10Cassandra, 06Services, 13Patch-For-Review: Change graphite aggregation function for cassandra 'count' metrics - https://phabricator.wikimedia.org/T121789#2760314 (10fgiunchedi) I've updated xenon and cerium aggregation methods for all `.count` metrics, @Eevans let's compare those in a week... [16:39:17] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:42:21] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2760340 (10Cmjohnson) Yes, Support Case Number: 5314079417-531 [16:52:57] (03PS5) 10Muehlenhoff: Also provide imagemagick wrapper in openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/316545 (https://phabricator.wikimedia.org/T145811) [16:52:57] (03PS5) 10Muehlenhoff: Also provide imagemagick wrapper in openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/316545 (https://phabricator.wikimedia.org/T145811) [16:55:45] (03CR) 10Muehlenhoff: [C: 032] Also provide imagemagick wrapper in openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/316545 (https://phabricator.wikimedia.org/T145811) (owner: 10Muehlenhoff) [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161101T1700). Please do the needful. [17:00:14] no parsoid deploy today [17:03:36] !log starting branch cut for 1.29.0-wmf.1 [17:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:12] that mean 1.28 is released? [17:04:37] arseny92 no [17:04:50] that wont be released until sometime this month [17:05:13] why the branching to next version then [17:05:28] Because 1.28 goes through rc now. [17:05:51] rc? [17:06:29] release candidate [17:06:41] https://lists.wikimedia.org/pipermail/wikitech-l/2016-October/086859.html [17:09:59] ah right [17:13:03] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:14:32] (03CR) 10Mobrovac: [WIP] evenstreams puppetization (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [17:18:41] (03PS1) 10Madhuvishy: nfs backup: Add mount definitions for backup volumes [puppet] - 10https://gerrit.wikimedia.org/r/319107 [17:21:35] 06Operations, 10Parsoid, 13Patch-For-Review, 15User-mobrovac: Deploy failed on wtp2017.codfw.wmnet - https://phabricator.wikimedia.org/T149115#2760562 (10Arlolra) @mobrovac Can I dirty my local tree with those changes (`command: depool-parsoid`) when deploying tomorrow? Getting through a deploy cleanly ha... [17:22:47] PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:26:45] (03CR) 10Madhuvishy: [C: 032] nfs backup: Add mount definitions for backup volumes [puppet] - 10https://gerrit.wikimedia.org/r/319107 (owner: 10Madhuvishy) [17:26:53] 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2760585 (10yuvipanda) The issues with role::puppetmaster::standalone not being able to be its own client are fixed now! htt... [17:27:27] I'm trying to do some updates on visualeditor-test.wmflabs.org [17:27:33] but all my vagrant commands fail with: [17:27:37] The provider 'lxc' could not be found, but was requested to [17:27:37] back the machine 'default'. Please use a provider that exists. [17:28:33] 06Operations, 10Parsoid, 13Patch-For-Review, 15User-mobrovac: Deploy failed on wtp2017.codfw.wmnet - https://phabricator.wikimedia.org/T149115#2760600 (10mobrovac) Yup, @Arlolra, that should be just fine, just don't checkout or cherry-pick that commit, as it will make Scap go crazy. [17:29:25] (03PS1) 10Madhuvishy: nfs backup: Fix requires paths on mount definitions [puppet] - 10https://gerrit.wikimedia.org/r/319108 [17:29:45] edsanders: check to see if you are getting the shell alias that maps 'vagrant' to `/usr/local/bin/mwvagrant` [17:30:07] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:30:35] on it ^ [17:30:38] no /usr/bin/vagrant [17:30:43] (03CR) 10Madhuvishy: [C: 032] nfs backup: Fix requires paths on mount definitions [puppet] - 10https://gerrit.wikimedia.org/r/319108 (owner: 10Madhuvishy) [17:31:41] I guess I can just type mwvagrant... [17:31:45] edsanders: /etc/profile.d/alias-vagrant.sh should set a shell alias that you need. You can use /usr/local/bin/mwvagrant directly. [17:31:54] thanks! [17:32:07] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:34:27] !log Rebooting host labstore2001 [17:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:52] papaul: I rebooted labstore2001 to test a few minutes back, and it hasn't come up [17:39:31] (03PS1) 10Dzahn: admin: add datacenter-ops on iron [puppet] - 10https://gerrit.wikimedia.org/r/319110 (https://phabricator.wikimedia.org/T147074) [17:39:35] not sure if the boot partition is wrong again, or it requires some other manual intervention to come up [17:41:07] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:41:53] I can get to the management console though [17:46:02] (03PS2) 10Dzahn: admin: add datacenter-ops on iron [puppet] - 10https://gerrit.wikimedia.org/r/319110 (https://phabricator.wikimedia.org/T147074) [17:49:00] madhuvishy: on lunch [17:49:07] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/srv/eqiad/others],File[/srv/eqiad/maps],File[/srv/eqiad/tools] [17:49:47] RECOVERY - puppet last run on etcd1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:53:54] I was able to get in to labstore2001 through the management console using root password, I checked the interface and mounts, seemed okay, I typed exit, but I seem to be repeatedly stuck with this series of events followed by this prompt [17:53:58] https://www.irccloud.com/pastebin/3BJnkika/ [17:54:13] if I type ctrl-D, the same thing happens over again [17:54:47] cmjohnson1: would you happen to know anything about this prompt? I am not sure how to leave now [17:55:42] PROBLEM - puppet last run on labstore2004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/srv/eqiad/others],File[/srv/eqiad/maps],File[/srv/eqiad/tools] [17:57:07] madhuvishy....exit should work [17:59:46] !log ban releases.wikimedia.org/debian from cache_misc to fetch Packages/Release again [17:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161101T1800). [18:00:04] arseny92: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:09] !log rebooting labvirt1002 [18:00:12] . [18:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:24] (03PS1) 10Madhuvishy: Remove nfs backup role from labstore200[3-4] in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/319112 [18:01:00] cmjohnson1: hmm it is now on loop with this - [* ] A start job is running for Activation of LVM2 logica...25s / no limit). Will try exit when/if it finishes [18:01:27] arseny92: are you sure this can be deployed? [18:01:38] ah yes, missed the " Community consensus link is here." [18:01:53] That has indeed be discussed on wiki, so yes, we can. [18:02:22] The local community held that for a week or so before filing that task. No objections were posted. [18:02:28] * Dereckson nods [18:03:35] So I'll take care of theis SWAT. [18:04:00] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319095 (https://phabricator.wikimedia.org/T149683) (owner: 10Arseny1992) [18:04:04] (03PS1) 10Jcrespo: mariadb: increase api resources for enwiki -high api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319113 [18:04:20] Tables have already been created. [18:04:31] (03Merged) 10jenkins-bot: Enable WikiLove extension on Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319095 (https://phabricator.wikimedia.org/T149683) (owner: 10Arseny1992) [18:04:33] Yes, by Reedy [18:05:13] arseny92: live on mw1099, please test. JS resources are cached for 5 minutes, so if it doesn't appear to immediately work, add ?debug=true to the URL [18:05:50] (03CR) 10Rush: [C: 031] Remove nfs backup role from labstore200[3-4] in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/319112 (owner: 10Madhuvishy) [18:06:08] (03CR) 10Madhuvishy: [C: 032] Remove nfs backup role from labstore200[3-4] in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/319112 (owner: 10Madhuvishy) [18:06:21] Dereckson, I see some bad patterns on enwiki, I have prepared https://gerrit.wikimedia.org/r/319113 [18:07:34] jynus: you want I sync it after arseny92 change, or I ping you when I'm done with SWAT? [18:07:57] I will for now watch evolution [18:08:08] will take a decision in some minutes, continue for now as normal [18:08:12] k [18:10:22] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [18:12:50] arseny92: ping ? [18:14:02] https://bn.wikisource.org/w/index.php?title=%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%AC%E0%A6%B9%E0%A6%BE%E0%A6%B0%E0%A6%95%E0%A6%BE%E0%A6%B0%E0%A7%80_%E0%A6%86%E0%A6%B2%E0%A6%BE%E0%A6%AA%3ABodhisattwa&type=revision&diff=673523&oldid=670069 [18:14:17] So I guess it works. [18:14:42] Logs look good. Syncing. [18:15:22] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [18:15:34] Yes. Enabled in prefs, then clicked the heart icon and did the message in the wizard. Resulting in the above edit tagged wikilove [18:15:44] So works [18:16:08] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable wikilove on bn.wikisource (duration: 01m 44s) [18:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:51] ACKNOWLEDGEMENT - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] Jeff_Green this is a known known [18:17:22] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:18:56] (03PS2) 10Jcrespo: mariadb: increase api resources for enwiki -high api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319113 [18:19:21] uh you forgot the taskid in the log message for stashbot task reply [18:20:02] madhuvishy: hey are you okay on labstore2001 now? [18:20:19] (03PS3) 10Jcrespo: mariadb: increase api resources for enwiki -high api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319113 [18:20:22] arseny92: update manually the task [18:21:06] papaul: not sure so far, i did an exit after logging in through management console - it's been doing this [ ***] A start job is running for Activation of LVM2 logica...53s / no limit) for the last 20 mintues [18:21:23] arseny92: you can mark it resolved with a comment it has been deployed. [18:21:31] Just did [18:22:03] Dereckson, I would like to do a commit now and another after you are finished [18:22:41] we finished ;) [18:22:43] RECOVERY - puppet last run on labstore2004 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [18:23:01] madhuvishy: is the network interface upp or down eth0? [18:23:08] papaul: it was up [18:23:14] jynus: I'm done [18:23:23] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 637 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3077905 keys, up 1 days 10 hours - replication_delay is 637 [18:23:26] ok, going to deploy one change now [18:23:27] papaul: i checked, and then exit, and now stuck on this loop [18:24:05] madhuvishy: i think it was the same loop i was having yesteray [18:24:11] aah [18:24:12] (03PS4) 10Jcrespo: mariadb: increase api resources for enwiki -high api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319113 [18:24:21] madhuvishy: can you logout for me to check thanks? [18:24:25] papaul: did it finish at some point? [18:24:31] papaul: i'm not sure how to [18:25:11] madhuvishy: not a problem i can end your session [18:25:19] papaul: ah alright [18:25:37] (03CR) 10Jcrespo: [C: 032] mariadb: increase api resources for enwiki -high api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319113 (owner: 10Jcrespo) [18:26:00] (03CR) 10Dzahn: [C: 032] "as discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/319110 (https://phabricator.wikimedia.org/T147074) (owner: 10Dzahn) [18:26:08] (03PS3) 10Dzahn: admin: add datacenter-ops on iron [puppet] - 10https://gerrit.wikimedia.org/r/319110 (https://phabricator.wikimedia.org/T147074) [18:26:30] madhuvishy: do control + \ and type exit and see [18:26:46] papaul: cool. left [18:26:50] madhuvishy: ok [18:27:24] !log jynus@tin Synchronized wmf-config/db-eqiad.php: increase api resources for enwiki -high api load (duration: 00m 48s) [18:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:08] it will take some time until I can deploy again, in case someone has to do something else [18:29:26] jynus: I think we're all done up until the train runs in 30 mins. That will take a bit of time. [18:29:34] thanks [18:30:17] I do not want to be alarming, but we are almost doubling our regular enwiki traffic: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=1&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=All&from=1477416604999&to=1478025005000 [18:31:20] madhuvishy: lvm2 is stopping the boot process someone that have OS access level needs to check that [18:31:54] chasemp: ^ [18:32:21] madhuvishy: i am out of the console [18:33:07] how did it manage to come up previously? [18:34:20] chasemp: Moritz did some magic [18:34:33] (03PS1) 10Jcrespo: mariadb: pool new enwiki api servers to 100% after initial warm-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319119 [18:35:03] chasemp: so it came up when Moritz worked on it than after madhuvishy restart it same problem [18:35:49] chasemp: I was able to get it once, and then this showed up when I typed exit [18:36:09] get in* [18:37:18] !log rebooting labvirt1003 [18:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:42] chasemp, madhuvishy, papaul: this morning when I had a look at labstore I could only log in via mgmt and the root password [18:38:52] it only had a loopback device [18:39:03] PROBLEM - Host tools.wmflabs.org is DOWN: CRITICAL - Time to live exceeded (tools.wmflabs.org) [18:39:14] when I manually ran "ifup eth0" network connectivity was fine again [18:39:31] but I needed to fixup a few services which had earlier failed to start (like lldpd) [18:39:53] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 36.38 ms [18:40:26] jynus: closed day in some countries [18:40:30] moritzm: so it was at a prompt though and you never rebooted again? [18:41:05] no, I didn't reboot [18:41:06] Dereckson, instant 200% traffic is not normal [18:41:28] (03PS3) 10Filippo Giunchedi: centralserver: add mtail for kernel messages [puppet] - 10https://gerrit.wikimedia.org/r/316544 (https://phabricator.wikimedia.org/T147923) [18:41:29] just enabled network and fixed up services so that a normal login via SSH become possible [18:41:30] (03PS5) 10Filippo Giunchedi: Introduce mtail module [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) [18:41:32] moritzm: k, that makes me think the reboots at these LVM sizes take a long time and it had a bit before you were looking [18:41:34] and not without increas on browser traffic [18:41:47] right, I mounted the /dev/backup lvms and rebooted [18:41:59] either bot or garget related, based on past experiences [18:42:09] *gadget [18:42:43] PROBLEM - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [18:42:45] chasemp: you mean it was still in bootup? I doubt that, there must have been at least 12 hours between Papaul leaving it and myself looking into it [18:42:45] ACKNOWLEDGEMENT - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T149728 [18:42:56] (03PS1) 10Filippo Giunchedi: prometheus: swap varnish_exporter ports fe/be [puppet] - 10https://gerrit.wikimedia.org/r/319120 [18:42:59] wow, that is bad timing [18:44:13] PROBLEM - Host www.toolserver.org is DOWN: CRITICAL - Host Unreachable (www.toolserver.org) [18:45:21] (03PS2) 10Filippo Giunchedi: prometheus: swap varnish_exporter ports fe/be [puppet] - 10https://gerrit.wikimedia.org/r/319120 [18:45:24] moritzm: for you nope, not after all that time (agreed) but for madhu after 20m or so sure [18:45:53] (03CR) 10Jcrespo: [C: 032] mariadb: pool new enwiki api servers to 100% after initial warm-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319119 (owner: 10Jcrespo) [18:46:05] I will still go ahead as planned [18:46:21] chasemp: ah, now I get it [18:46:42] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: swap varnish_exporter ports fe/be [puppet] - 10https://gerrit.wikimedia.org/r/319120 (owner: 10Filippo Giunchedi) [18:47:00] because I believe that was probably the cause of recent lag on that server [18:48:28] !log jynus@tin Synchronized wmf-config/db-eqiad.php: pool new enwiki api servers to 100% after initial warm-up (duration: 00m 49s) [18:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:23] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3072538 keys, up 1 days 10 hours - replication_delay is 0 [18:51:14] (03CR) 10Filippo Giunchedi: "latest PSes add graphite_prefix option to command line arguments for which I've sent the patch upstream at https://github.com/google/mtail" [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [18:51:21] (03PS4) 10Filippo Giunchedi: centralserver: add mtail for kernel messages [puppet] - 10https://gerrit.wikimedia.org/r/316544 (https://phabricator.wikimedia.org/T147923) [18:51:22] (03PS6) 10Filippo Giunchedi: Introduce mtail module [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) [18:55:10] (03CR) 10Filippo Giunchedi: [C: 032] centralserver: add mtail for kernel messages [puppet] - 10https://gerrit.wikimedia.org/r/316544 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [18:55:24] and when I fix it, the issue disappears :-/ [18:56:23] (03PS5) 10Filippo Giunchedi: centralserver: add mtail for kernel messages [puppet] - 10https://gerrit.wikimedia.org/r/316544 (https://phabricator.wikimedia.org/T147923) [18:56:25] (03PS7) 10Filippo Giunchedi: Introduce mtail module [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) [18:56:48] (03CR) 10Ori.livneh: "The "enabled" parameter is documented and validated but not used, AFAICT" [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [18:58:23] (03PS2) 10Dzahn: clean up indentation, formatting and comments [puppet] - 10https://gerrit.wikimedia.org/r/318893 (owner: 10ArielGlenn) [18:58:46] (03PS3) 10ArielGlenn: mgmt/changepw: clean up indentation, formatting and comments [puppet] - 10https://gerrit.wikimedia.org/r/318893 [18:59:07] (03CR) 10Dzahn: [C: 032] mgmt/changepw: clean up indentation, formatting and comments [puppet] - 10https://gerrit.wikimedia.org/r/318893 (owner: 10ArielGlenn) [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161101T1900). [19:00:19] ori: used in default.erb as @enabled [19:00:33] mutante: ping? If so, could you restore https://gerrit.wikimedia.org/r/#/c/305120/ and https://gerrit.wikimedia.org/r/#/c/305095/ ? [19:01:18] (03Restored) 10Dzahn: add projectcom.wikimedia.org for new private wiki [dns] - 10https://gerrit.wikimedia.org/r/305120 (https://phabricator.wikimedia.org/T143138) (owner: 10Dzahn) [19:01:30] RECOVERY - Host www.toolserver.org is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [19:01:33] (03PS8) 10Filippo Giunchedi: Introduce mtail module [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) [19:01:36] godog: woops, correct [19:01:38] sorry! [19:01:53] ori: no worries :) thanks for taking a look! [19:01:54] (03Restored) 10Dzahn: realm: add 'projectcom' to private wiki list [puppet] - 10https://gerrit.wikimedia.org/r/305095 (https://phabricator.wikimedia.org/T143138) (owner: 10Dzahn) [19:01:58] (03CR) 10Ori.livneh: [C: 031] Introduce mtail module [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [19:02:01] Dereckson: done [19:02:26] jouncebot: next [19:02:26] In 3 hour(s) and 57 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161101T2300) [19:02:29] jouncebot: now [19:02:29] For the next 1 hour(s) and 57 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161101T1900) [19:02:34] mutante: thanks, that will be the wiki URL finally [19:02:58] Dereckson: ok, great. restore is easy and i prefer that over a patch sitting in queue for such a long time [19:03:12] (03CR) 10Filippo Giunchedi: [C: 032] Introduce mtail module [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [19:03:28] (03PS6) 10Filippo Giunchedi: centralserver: add mtail for kernel messages [puppet] - 10https://gerrit.wikimedia.org/r/316544 (https://phabricator.wikimedia.org/T147923) [19:03:33] (03CR) 10Filippo Giunchedi: [V: 032] centralserver: add mtail for kernel messages [puppet] - 10https://gerrit.wikimedia.org/r/316544 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [19:03:35] jynus: would I be stepping on your toes if I started the train? Will block other deploys for probably an hour-ish [19:05:14] no [19:05:30] I do not plan to do mre changes [19:05:37] ok :) [19:05:42] starting the train now then [19:05:52] this configuration is ok now and better if there are issues again [19:07:02] there are cirrus count issues, but those are known issues [19:08:20] (03PS3) 10Dzahn: realm: add 'projectcom' to private wiki list [puppet] - 10https://gerrit.wikimedia.org/r/305095 (https://phabricator.wikimedia.org/T143138) [19:09:18] !log thcipriani@tin Started scap: testwiki to 1.29.0-wmf.1 and rebuild l10n cache [19:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:25] chasemp: papaul latest on labstore2001 - console com2 has stopped giving me the LVM activation message, but it's just blank screen [19:19:09] ^^ grrrit-wm1 is not me [19:19:26] madhuvishy: did you try and reboot labstore2001? [19:20:32] cmjohnson1: that's where I started this morning. It didn't come up - I went to management console. Network and disks seemed okay. I typed exit -> lvm activation loop. Now I get blank screen on console com2, no prompt [19:21:39] most likely there is an error that can only be seen by plugging into going to reboot and check something [19:22:20] !log rebooting labvirt1004, labvirt1007 [19:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:56] cmjohnson1: ahh hmmm [19:27:38] papaul: could you take a look? ^ it also currently says Serial device in use so you might be already [19:28:40] i am in the serial right now [19:28:54] getting this msg [19:28:57] [FAILED] Failed to start LSB: NFS support files common to client and server. [19:28:57] See 'systemctl status nfs-common.service' for details [19:30:46] yeah that's because there was (errantly) nfs server installed there at some point by previous maintainers I think [19:30:51] it shouldn't be a show stopper tho [19:33:33] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:22] hmmm my scheduled downtime must be over [19:36:29] silenced icinga for 2001 [19:36:44] chasemp/madhuvishy: [ TIME ] Timed out waiting for device dev-mapper-os\x2dvar.device. [19:37:11] that's the LVM2 device /var is on under os vg I think [19:37:14] !log thcipriani@tin Finished scap: testwiki to 1.29.0-wmf.1 and rebuild l10n cache (duration: 27m 56s) [19:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:39] I think that may be your culprit https://p.defau.lt/?_l4Vxi7vCXqE_ED8pAxZiw [19:39:10] [DEPEND] Dependency failed for File System Check on /dev/mapper/os-var. [19:41:56] (03PS1) 10Thcipriani: Group0 to php-1.29.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319124 [19:42:39] (03CR) 10Thcipriani: [C: 032] Group0 to php-1.29.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319124 (owner: 10Thcipriani) [19:43:07] (03Merged) 10jenkins-bot: Group0 to php-1.29.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319124 (owner: 10Thcipriani) [19:44:35] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to php-1.29.0-wmf.1 [19:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:29] cmjohnson1: does it prompt for root pass after that? google suggests 'lvm vgchange -ay' then 'systemctl default' may get us over the hump for the moment [19:48:38] yes it does [19:48:48] i left it at that prompt [19:48:59] chasemp: i'm in there [19:50:09] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2761081 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [19:50:12] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2761081 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [19:51:17] chasemp: there's [19:51:19] https://www.irccloud.com/pastebin/QvhL9HcP/ [19:51:22] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 620 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3075312 keys, up 1 days 11 hours - replication_delay is 620 [19:51:22] on fstab [19:51:46] madhuvishy: I would comment out var there if needed [19:52:15] but we may need to take the time to fix this semi-permanently [19:52:16] it says managed by puppet but i dont think these entries are [19:52:54] I don't think puppet purges undefines there so it's misleading [19:53:03] yup [19:55:54] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.29 (and the regression release 4.4.30) [debs/linux44] - 10https://gerrit.wikimedia.org/r/319096 (owner: 10Muehlenhoff) [19:56:01] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2761108 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are... [19:56:03] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2761108 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are... [19:58:22] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3072321 keys, up 1 days 11 hours - replication_delay is 0 [20:06:29] (03PS3) 10Dzahn: add mapped IPv6 address for eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/317192 [20:06:49] 06Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670#2761137 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your requ... [20:06:50] 06Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670#2761137 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your requ... [20:09:05] !log rebooting labvirt1006 [20:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:40] !log rebooting labvirt1008 [20:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:34] (03PS4) 10Ottomata: [WIP] evenstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) [20:31:35] (03CR) 10jenkins-bot: [V: 04-1] [WIP] evenstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [20:31:35] (03PS1) 10Ottomata: Use Ruby yaml lib to render node::service deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) [20:32:33] (03CR) 10jenkins-bot: [V: 04-1] Use Ruby yaml lib to render node::service deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [20:33:34] !log rebooting labvirt1005 and labvirt1009 [20:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:12] 06Operations, 10MediaWiki-Database, 07Performance: Use mysqli both in Zend and HHVM - https://phabricator.wikimedia.org/T149742#2761275 (10MaxSem) [20:34:27] 06Operations, 10MediaWiki-Database, 07Performance: Use mysqli both in Zend and HHVM - https://phabricator.wikimedia.org/T149742#2761275 (10MaxSem) [21:17:06] Urbanecm: not sure, but here are announcements https://lists.wikimedia.org/pipermail/labs-announce/2016-November/thread.html [21:18:02] mutante: I noticed https://lists.wikimedia.org/pipermail/labs-announce/2016-November/000176.html . 18:00 UTC is over and next scheduled outage is at November 14. [21:19:21] Urbanecm: can you ask that in -labs please [21:19:47] mutante: Yes. I think see -labs was that I should only listen there. [21:19:58] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2761484 (10Paladox) @Legoktm I've figured out how to do that. I found an npm library that can restart the whole script. [21:20:10] Urbanecm: no, i meant to ask and listen there :) [21:20:48] (03CR) 10BryanDavis: [C: 031] Use page.text() instead of deprecated page.edit() [debs/adminbot] - 10https://gerrit.wikimedia.org/r/319137 (https://phabricator.wikimedia.org/T124852) (owner: 10MtDu) [21:21:03] (03PS1) 10Filippo Giunchedi: mtail: introduce systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/319138 (https://phabricator.wikimedia.org/T147923) [21:34:36] !log rebooting labvirt1012 [21:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:46] (03CR) 10Filippo Giunchedi: [C: 032] mtail: introduce systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/319138 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [21:38:49] (03PS1) 10Dzahn: add AAAA and PTR for eventlog1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/319150 [21:44:52] im an elf [21:44:58] !ops im an elf [21:45:14] !ops im an elf [21:45:52] ಠ_ಠ [21:46:13] werelizard: Can you add +t? [21:46:21] No, I don't have ops in this channel [21:46:44] Hm. Ah well. [21:55:23] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2761628 (10RobH) The eqiad spare has been allocated and is handed off for use. The codfw host has been ordered on task T145112. It has an ETA of 2016-... [21:55:32] !log rebooting labvirt1013 [21:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:47] 06Operations, 10Monitoring, 13Patch-For-Review: Extract metrics from logs - https://phabricator.wikimedia.org/T147923#2761634 (10fgiunchedi) 05Open>03stalled This is now done and metrics from our syslog central servers are being pushed to graphite under `mtail` hierarchy. I'm stalling this since we're us... [22:06:34] (03PS1) 10Yuvipanda: quarry: Explicitly add python2 plugin [puppet] - 10https://gerrit.wikimedia.org/r/319231 [22:06:47] (03PS2) 10Dzahn: add mapped IPv6 address for contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/316040 [22:06:52] (03PS2) 10Yuvipanda: quarry: Explicitly add python2 plugin [puppet] - 10https://gerrit.wikimedia.org/r/319231 [22:07:01] (03CR) 10Dzahn: [C: 032] add mapped IPv6 address for contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/316040 (owner: 10Dzahn) [22:08:47] (03CR) 10Yuvipanda: [C: 032] quarry: Explicitly add python2 plugin [puppet] - 10https://gerrit.wikimedia.org/r/319231 (owner: 10Yuvipanda) [22:25:30] 06Operations, 06Performance-Team, 10Thumbor: Thumbor instances exit with exit code 0 even when crashing/failing - https://phabricator.wikimedia.org/T149560#2756079 (10fgiunchedi) @joe it doesn't look like we've updated firejail on thumbor machines so it could be the firejail bug indeed: ``` $ apt-cache poli... [22:25:40] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [22:30:00] PROBLEM - DPKG on thumbor1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:33:31] that's me ^ (thumbor) [22:35:00] RECOVERY - DPKG on thumbor1002 is OK: All packages OK [22:40:19] 06Operations, 06Performance-Team, 10Thumbor: Thumbor instances exit with exit code 0 even when crashing/failing - https://phabricator.wikimedia.org/T149560#2761689 (10fgiunchedi) also firejail 0.9.40-3 is supposed to fix this problem, according https://phabricator.wikimedia.org/T136957#2515944 so the unaffec... [22:55:52] (03PS1) 10Alex Monk: Remove extra non-ASCII character in role::cache::text that was causing issues [puppet] - 10https://gerrit.wikimedia.org/r/319243 [22:57:14] (03CR) 10Yuvipanda: [C: 032] Remove extra non-ASCII character in role::cache::text that was causing issues [puppet] - 10https://gerrit.wikimedia.org/r/319243 (owner: 10Alex Monk) [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161101T2300). [23:23:11] (03PS3) 10Dzahn: add mapped IPv6 address for contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/316040 [23:24:00] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2761798 (10Paladox) @Legoktm ok ive fixed everything https://gerrit.wikimedia.org/r/318976 now. It's ready to be merged but needs... [23:28:08] (03PS1) 10Alex Monk: deployment-prep: Fix deployment access.conf rules to allow all deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/319249 [23:28:13] (03PS2) 10Yuvipanda: shinkengen: Ensure consistent ordering of hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/317294 (owner: 10Alex Monk) [23:28:17] (03CR) 10Yuvipanda: [C: 032 V: 032] shinkengen: Ensure consistent ordering of hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/317294 (owner: 10Alex Monk) [23:29:02] (03PS4) 10Dzahn: add mapped IPv6 address for contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/316040 [23:29:38] (03CR) 10Yuvipanda: "Is this still needed, esp. with deployment-prep moving to puppetmaster::standalone?" [puppet] - 10https://gerrit.wikimedia.org/r/310729 (owner: 10Alex Monk) [23:30:19] (03CR) 10Yuvipanda: [C: 031] "+1. I'd want andrew around when merging tho" [puppet] - 10https://gerrit.wikimedia.org/r/307660 (owner: 10Alex Monk) [23:30:34] (03CR) 10Alex Monk: "I gave up with conftool stuff because it only seems useful with LVS right now, and we can't get LVS." [puppet] - 10https://gerrit.wikimedia.org/r/310729 (owner: 10Alex Monk) [23:31:35] (03PS2) 10Yuvipanda: deployment-prep: Fix deployment access.conf rules to allow all deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/319249 (owner: 10Alex Monk) [23:31:43] (03CR) 10Yuvipanda: [C: 032 V: 032] deployment-prep: Fix deployment access.conf rules to allow all deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/319249 (owner: 10Alex Monk) [23:35:33] let's see if it blends this time.. [23:41:59] grrrit-wm: restart [23:45:34] (03PS1) 10MaxSem: Switch discovery-stats cronjob to a dedicated script [puppet] - 10https://gerrit.wikimedia.org/r/319252 (https://phabricator.wikimedia.org/T149722) [23:49:38] (03PS1) 10EBernhardson: Add a wiki configuration tag for configured language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319253 (https://phabricator.wikimedia.org/T149755) [23:51:56] (03CR) 10EBernhardson: "randomly guessed at reviewers, i've added a few people that git blame says have previously looked at the nearby code :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319253 (https://phabricator.wikimedia.org/T149755) (owner: 10EBernhardson)