[00:03:16] RECOVERY - puppet last run on db1038 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [00:08:06] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [00:09:46] RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [00:13:36] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:28:26] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:41:36] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [00:45:36] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:56:26] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [01:13:36] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [01:51:36] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:12:06] (03CR) 10Chad: [C: 04-1] "As I've mentioned before, there's no methodology behind how you picked 15 seconds. Why not 20? Why not 10000? Why not 2? Please provide a " [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) (owner: 10Paladox) [02:12:59] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2999987 (10demon) 05Open>03declined Meh, not worth investigating...seems as though it was transient. [02:14:36] (03CR) 10Tim Landscheidt: "I have installed https://integration.wikimedia.org/ci/job/debian-glue/608/artifact/toollabs-webservice_0.33%7Edev+0%7E20170205034157.608+t" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) (owner: 10Tim Landscheidt) [02:19:36] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [02:20:32] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.10) (duration: 07m 30s) [02:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Feb 6 02:25:51 UTC 2017 (duration 5m 20s) [02:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:43] (03CR) 10Tim Landscheidt: "Puppet wanted to downgrade `tools-webservice`, so I put the dev version in `aptly` and ran Puppet on all hosts." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) (owner: 10Tim Landscheidt) [03:22:46] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:50:46] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [03:53:46] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.200 second response time [04:15:16] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=5063.20 Read Requests/Sec=4057.90 Write Requests/Sec=4.80 KBytes Read/Sec=34480.80 KBytes_Written/Sec=22.00 [04:20:46] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.357 second response time [04:23:16] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=82.80 Read Requests/Sec=156.90 Write Requests/Sec=1.80 KBytes Read/Sec=8694.40 KBytes_Written/Sec=34.80 [04:47:56] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 1800.518122 Seconds [04:48:56] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [05:37:44] 06Operations: Expire time on 404 is too high (Wikipedia) - https://phabricator.wikimedia.org/T157214#3000284 (10Peachey88) [05:40:36] (03PS1) 10Ladsgroup: ores: increase capacity [puppet] - 10https://gerrit.wikimedia.org/r/336176 (https://phabricator.wikimedia.org/T157206) [05:46:30] ops around to get this reviewed ^ [05:46:34] ? [05:48:16] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:17:16] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:26:58] 06Operations: Expire time on 404 is too high (Wikipedia) - https://phabricator.wikimedia.org/T157214#2999306 (10MZMcBride) Can you please specify which HTTP header(s) you're referring to and provide an example URL? [06:27:56] PROBLEM - carbon-cache@b service on graphite2002 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is failed [06:28:06] PROBLEM - Check systemd state on graphite2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:29:46] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:33:46] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:45:56] RECOVERY - carbon-cache@b service on graphite2002 is OK: OK - carbon-cache@b is active [06:46:06] RECOVERY - Check systemd state on graphite2002 is OK: OK - running: The system is fully operational [06:48:06] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:50:46] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:57:46] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:58:34] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335980 (https://phabricator.wikimedia.org/T156161) (owner: 10Marostegui) [07:00:02] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335980 (https://phabricator.wikimedia.org/T156161) (owner: 10Marostegui) [07:00:12] (03CR) 10jenkins-bot: db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335980 (https://phabricator.wikimedia.org/T156161) (owner: 10Marostegui) [07:01:25] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2060 - T156161 (duration: 00m 40s) [07:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:30] T156161: db2060 not accessible - https://phabricator.wikimedia.org/T156161 [07:01:46] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:03:25] !log Upgrade mariadb+packages db1039 - T153300 [07:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:29] T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300 [07:04:25] marostegui: Hey, if it's okay can you check this? https://gerrit.wikimedia.org/r/336176 (https://phabricator.wikimedia.org/T157206) [07:04:35] It's for an UBN task [07:04:41] https://grafana.wikimedia.org/dashboard/db/ores?panelId=9&fullscreen&from=now-6h&to=now [07:05:01] Amir1: checking [07:05:05] Thanks! [07:05:55] Amir1: the change looks good to me, but I am completely lost in context if that is a good or bad change :) [07:06:50] marostegui: It increases ores workers capacity, we did this several times when it was under pressure. Let me grab previous examples [07:07:35] https://gerrit.wikimedia.org/r/336048 [07:07:45] https://gerrit.wikimedia.org/r/316271 [07:08:24] marostegui: the biggest problem in scb nodes are their memories (at least for scb1001 and scb1002) but I'm monitoring them and they are okay [07:08:29] https://ganglia.wikimedia.org/latest/graph.php?r=1hr&z=xlarge&h=scb1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=mem_report&c=Service+Cluster+B+eqiad [07:09:09] so they increase of workers will increase the memory but they do have a ton free [07:09:38] yup [07:10:12] It increases CPU usage too but they are no worries https://grafana.wikimedia.org/dashboard/db/ores [07:13:07] <_joe_> Amir1: shouldn't we just ban that chinese bot? [07:13:56] <_joe_> in the mediawiki extension directly so that we don't need to do things with ores? [07:13:58] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 2 others: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#3000449 (10greg) {F5506544} Whatever it was hasn't subsided yet. [07:14:00] _joe_: hey, the problem is that 1- I'm not totally sure it's the chinese bot 2- The don't hit ores, they hit mediawiki and make mw nodes request to ores using api.php [07:14:27] <_joe_> yes, I know, that's why I was suggesting to do it in mediawiki [07:14:38] <_joe_> I'm not ok with raising workers again [07:14:50] <_joe_> so give me 5 mins and I'll investigate [07:14:56] _joe_: okay [07:15:10] thanks _joe_ [07:16:06] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:22:09] (03PS2) 10Marostegui: mariadb: Use the common gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/335817 (https://phabricator.wikimedia.org/T149418) [07:23:31] (03CR) 10Marostegui: [C: 032] mariadb: Use the common gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/335817 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [07:24:17] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 2 others: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#2999048 (10Joe) Before raising the number of workers for ORES: - has anyone done an analysis of where... [07:24:24] <_joe_> Amir1: see my response [07:25:26] <_joe_> Amir1: we can't keep increasing the number of ORES workers indefinitely [07:26:29] <_joe_> Amir1: so it seems that most of the requests come from ChangePropagation [07:26:34] <_joe_> why? [07:26:37] _joe_: Yes, I understand. I was thinking of it as a temp. solution until we can find out who is doing this requests since the stress is now disrupting ores review tool in Wikipedia [07:26:59] <_joe_> Amir1: let's just turn off changeprop calling it [07:27:03] <_joe_> problem solved [07:27:10] _joe_: that's precaching. Meaning someone is doing edit bot-ly but not using a bot account [07:27:36] <_joe_> ok but if precaching accounts for 99% of your requests [07:27:43] <_joe_> you're doing something very, very wrong [07:27:59] <_joe_> it's this whole damaged paradigm we keep stumbling upon [07:28:01] let me check if it's true for a certain wiki [07:28:16] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.375 second response time [07:28:18] <_joe_> I'm looking at main.log on the scb cluster [07:28:30] <_joe_> and it's 99^ ok requests coming from changeprop [07:28:35] <_joe_> let me be precise [07:28:38] _joe_: no, it's also ores extension requesting once the edit is made so we response from cache later on [07:29:07] turning precaching off increases response time by multiple of four, for humans [07:30:03] wow [07:30:10] (03PS4) 10Elukey: Add aqs1009-a to the AQS Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/335993 (https://phabricator.wikimedia.org/T155654) [07:30:16] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.631 second response time [07:30:36] because scoring an edit using AI is time-consuming [07:30:53] !log Stop MySQL on db1095 to snapshot it to es1017 - T153743 [07:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:57] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:31:16] <_joe_> yeah but that would solve your overload problem, given actually overall 70% of requests are for precaching (I re-checked on a longer interval) [07:31:45] <_joe_> Amir1: and if we make the loading of the AI-generated score async, it can even be a not horrible UX for users [07:32:04] <_joe_> precaching is the lamest thing one can do, and we do it extensively in all of our services stratum [07:32:41] It's async already [07:34:16] <_joe_> Amir1: do we have any statistics on cache hits for ores? [07:34:26] _joe_: yup [07:34:27] https://grafana.wikimedia.org/dashboard/db/ores [07:35:19] <_joe_> Amir1: if I read this correctly [07:35:28] _joe_: precaching is 360 per min. it can't be the reason for the stress, we can handle much much more [07:35:40] <_joe_> Amir1: the logs beg to differ [07:36:05] (03CR) 10Elukey: [C: 032] Add aqs1009-a to the AQS Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/335993 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [07:36:29] https://grafana.wikimedia.org/dashboard/db/ores?panelId=1&fullscreen&from=now-7d&to=now [07:36:49] <_joe_> yeah again, the logs tell a different story [07:37:13] <_joe_> main.log on scb1003 has 15.2 K lines from changeprop out of 24K total [07:37:30] <_joe_> either we don't log every request, or your instrumentation is wrong [07:38:36] !log bootstrapping aqs1009-a (new AQS cassandra instance) [07:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:44] <_joe_> Amir1: anyways, I'd turn off precaching at this point [07:40:06] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3531 [07:40:36] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I would rather turn precaching off temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/336176 (https://phabricator.wikimedia.org/T157206) (owner: 10Ladsgroup) [07:41:46] PROBLEM - puppet last run on aqs1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:43:10] <_joe_> Amir1: ok I think I understood what you're reading wrong [07:43:33] <_joe_> "Score processed (not cached)" refers to the number of scores ORES IS processing [07:43:39] <_joe_> independently from the source [07:43:51] <_joe_> not the pre-caching requests [07:44:06] PROBLEM - Check systemd state on aqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:45:10] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 2 others: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#3000507 (10Joe) So after taking a quick look at ORES's logs: around 70% of requests come from changepro... [07:45:46] RECOVERY - puppet last run on aqs1009 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:46:06] RECOVERY - Check systemd state on aqs1009 is OK: OK - running: The system is fully operational [07:47:18] _joe_: I'm trying to see if CP load has changed substantially in the past couple of days: https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=8&fullscreen&from=now-7d&to=now [07:47:38] <_joe_> it has [07:47:47] <_joe_> or it hasn't [07:47:56] <_joe_> it's still the lion's share of requests [07:48:32] _joe_: I think I can reduce it without hurting anything [07:48:41] but it might take some time [07:50:07] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 572452 Threads: 1 Questions: 7913409 Slow queries: 2879 Opens: 4735 Flush tables: 1 Open tables: 577 Queries per second avg: 13.823 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [07:50:35] <_joe_> Amir1: well we're not in an emergency, but I think I can just turn off the rule in CP [07:51:16] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.227 second response time [07:52:53] <_joe_> let me hand-patch it on one server [07:53:14] _joe_: If it's okay. Wait, until I make a patch in puppet [07:53:16] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.545 second response time [07:53:36] for disabling CP in several wikis (instead of turning it off in all wikis) [07:53:45] I explain why [07:54:00] <_joe_> is it possible to control it via puppet? [07:54:04] <_joe_> I wasn't aware :) [07:54:38] Yeah, but I think it needs some restart or something like that, services know better. [07:54:41] mobrovac Pchelolo [07:55:01] <_joe_> I know how to handle that [07:55:12] great [07:55:16] <_joe_> but the conf you're referring to is in the CP deploy repo [07:55:19] <_joe_> I think [07:56:30] there are duplicates (at least there were) I made some patches [07:56:44] the main one is in the puppet [07:57:22] _joe_: an example [07:57:23] https://gerrit.wikimedia.org/r/#/c/304125/ [07:57:51] (and you merged it :P) [07:57:52] <_joe_> Amir1: yeah, not anymore [07:57:57] <_joe_> I'm telling you ;) [07:58:10] I wasn't aware of that [07:58:46] Hey [07:59:00] I'm still driving home and I don't have my laptop [07:59:11] it's okay [07:59:13] <_joe_> Pchelolo: I'm not sure you're needed [07:59:13] The confit is in the deploy repo for cp [07:59:23] <_joe_> drive safely [07:59:43] I'm a passenger, so it's fine [07:59:48] <_joe_> aha oki [08:01:12] <_joe_> Pchelolo: so, ores is overloaded atm and I'm not comfortable raising the number of workers yet again, and I want to understand the source of the overload; I am going to turn off precaching via CP for now to figure out what's going on once that is our (it will also greatly reduce the load anyways). Anything I should look out for when restarting CP? [08:01:26] <_joe_> Pchelolo: apart from restarting them one at a time at some interval? [08:04:12] Joe just merge a configured chabge in deploy repo and do a normal scrap deploy [08:04:26] No other steps nesessry [08:05:06] Confit change. Typing on a phone is hard.. [08:05:35] <_joe_> Pchelolo: I got you, thanks :) [08:05:52] <_joe_> Amir1: should I do it or are you working on it? [08:06:01] _joe_: I'm on it [08:06:01] Kk, I'll be home in about 40 minutes, can take a look too [08:06:16] <_joe_> no need Pchelolo, it's late at night, just rest [08:06:23] _joe_: If it's okay I only disable it for eight wikis (wikidata and enwiki included) [08:06:32] <_joe_> Amir1: that seems about right [08:06:35] Cool, thank you ;) see you tomorrow [08:06:41] Pchelolo: see you [08:09:03] (03PS1) 10Elukey: Add aqs100[89] to the AQS conftool data [puppet] - 10https://gerrit.wikimedia.org/r/336192 (https://phabricator.wikimedia.org/T155654) [08:12:59] (03CR) 10Elukey: [C: 032] Add aqs100[89] to the AQS conftool data [puppet] - 10https://gerrit.wikimedia.org/r/336192 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [08:15:07] <_joe_> Amir1: when you have the patch ready, let me know [08:15:43] _joe_: It just did https://github.com/wikimedia/change-propagation/pull/161 [08:15:46] (03CR) 10Giuseppe Lavagetto: [C: 032] Generalize entities definitions [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [08:16:15] <_joe_> Amir1: oh it's on GH? [08:16:18] <_joe_> the config too? [08:16:26] <_joe_> What. The. Fuck. [08:16:43] (03Merged) 10jenkins-bot: Generalize entities definitions [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [08:17:36] _joe_: AFAIK it's in github. [08:17:52] <_joe_> Amir1: yes you are correct, and my comment wasn't directed to you [08:18:27] you pinged me and it had question mark :P [08:18:28] <_joe_> I'm waiting for Github's CI to pass and I'll merge your PR [08:18:35] Thanks [08:19:05] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1008.eqiad.wmnet [08:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:09] FSM, nice! [08:20:20] (03PS1) 10Muehlenhoff: Add musikanimal to analytics-privatadata-users [puppet] - 10https://gerrit.wikimedia.org/r/336193 (https://phabricator.wikimedia.org/T156986) [08:20:26] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336194 (https://phabricator.wikimedia.org/T153743) [08:20:40] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 2 others: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#3000706 (10Ladsgroup) https://github.com/wikimedia/change-propagation/pull/161 will reduce the old by d... [08:21:32] _joe_: Tests depend on enwiki: https://travis-ci.org/wikimedia/change-propagation/jobs/198760004 [08:21:38] Let me fix that [08:24:15] <_joe_> jesus [08:24:31] <_joe_> tests depending on live config [08:24:39] <_joe_> /o\ [08:26:02] (03PS2) 10Muehlenhoff: Add musikanimal to analytics-privatadata-users [puppet] - 10https://gerrit.wikimedia.org/r/336193 (https://phabricator.wikimedia.org/T156986) [08:26:14] :)))) [08:30:23] (03CR) 10Muehlenhoff: [C: 032] Add musikanimal to analytics-privatadata-users [puppet] - 10https://gerrit.wikimedia.org/r/336193 (https://phabricator.wikimedia.org/T156986) (owner: 10Muehlenhoff) [08:31:27] <_joe_> Amir1: still failing [08:31:56] Yeah, strange, let me check. I come back to you once I understood what's wrong [08:36:52] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:38:46] _joe_: I just realized what's wrong. Even though we ores supports dewiki but precaching is not enabled on German Wikipedia and no one noticed that [08:38:49] ... [08:38:56] <_joe_> lol [08:39:35] <_joe_> so let's use a small wiki (or a mock wiki and from a mock config, but I digress) [08:40:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336194 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:41:07] 06Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#3000766 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff @MusikAnimal I've enabled you... [08:41:31] <_joe_> Amir1: :))) [08:41:33] <_joe_> thanks [08:41:52] tools.wmflabs.org/dexbot/tools/wikilabels_stats.php [08:42:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336194 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:42:17] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336194 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:42:24] I used wikis that are very much behind in labeling (so won't have the damaging model anytime soon) [08:42:51] I wanted to go with azwiki but it wasn't there as well [08:43:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1064 - T153743 (duration: 00m 41s) [08:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:32] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [08:50:01] (03Abandoned) 10Ladsgroup: ores: increase capacity [puppet] - 10https://gerrit.wikimedia.org/r/336176 (https://phabricator.wikimedia.org/T157206) (owner: 10Ladsgroup) [08:52:44] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 2 others: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#3000780 (10Ladsgroup) I'm pretty someone externally is putting pressure on the service too. This correl... [08:54:38] <_joe_> Amir1: I am trying to understand how the whole changeprop deploy repo thing works [08:54:41] <_joe_> it's not easy tbh [08:55:54] <_joe_> Amir1: btw you changed the example config [08:55:56] _joe_: Thanks. I hope is there a doc somewhere [08:55:58] <_joe_> not the actual one [08:56:30] _joe_: Where is the actual one? I searched for ores and that was the only place [08:56:35] <_joe_> uhm [08:57:07] It seems the example is misleading because in my previous PRs I changed that file only [08:57:15] <_joe_> Amir1: yes, yes [09:01:39] (03CR) 10Hashar: [C: 032] Correct weekday in changelog entry [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336055 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [09:02:16] (03Merged) 10jenkins-bot: Correct weekday in changelog entry [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336055 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [09:02:20] (03CR) 10Hashar: [C: 031] Add extended description to control [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336056 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [09:04:52] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:06:22] <_joe_> Amir1: it's taking me so long because there are unreleased changes on changeprop already [09:06:26] <_joe_> and I have to weed them out [09:07:17] 06Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287#2747695 (10elukey) @Cmjohnson ping :) Can we apply some thermal paste to one host as test? [09:07:23] _joe_: oh, shoot. Cherry-pick might help [09:08:13] 06Operations: Package the next LTS kernel (4.9) - https://phabricator.wikimedia.org/T154934#3000815 (10ema) p:05Triage>03Normal [09:08:13] <_joe_> Amir1: that's not so easy [09:08:20] <_joe_> I have to rebase/rearrange changes [09:09:04] Yeah [09:09:11] 06Operations, 10ops-eqiad: mw1236 powered down and not able to powerup - https://phabricator.wikimedia.org/T156610#3000816 (10elukey) Extended the downtime to prevent spurious notifications in IRC. [09:09:30] <_joe_> Amir1: actually, you should've made your change here [09:09:51] <_joe_> scap/templates/config.yaml.j2 [09:09:55] <_joe_> in the deploy repo [09:10:25] <_joe_> so I am going to propagate your changes there [09:10:30] _joe_: I can do it now, where is it though? [09:10:30] <_joe_> (pun intended!) [09:12:06] <_joe_> Amir1: doing it myself [09:12:08] _joe_: Thanks [09:15:29] <_joe_> Amir1: https://gerrit.wikimedia.org/r/#/c/336197/ [09:17:12] (03PS1) 10Marostegui: db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336198 (https://phabricator.wikimedia.org/T153300) [09:19:33] <_joe_> Amir1: can you double-check it? [09:19:48] _joe_: of course [09:19:52] I've got distracted [09:20:00] <_joe_> np :) [09:20:44] _joe_: Wikidata is missing [09:21:06] <_joe_> Amir1: hah, right, amending [09:21:13] thanks [09:22:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336198 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui) [09:24:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336198 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui) [09:24:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336198 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui) [09:25:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1028 - T153300 (duration: 00m 42s) [09:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:54] T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300 [09:26:12] Dereckson: how can I get a pageid to delete a bad page title? [09:26:24] https://es.wikipedia.org/wiki/Especial:ApiSandbox#action=query&format=json&servedby=1&curtimestamp=1&responselanginfo=1&prop=linkshere&indexpageids=1&titles=C+%3A+The+Contra+Adventure&utf8=1 [09:26:59] !log Deploy ALTER table db1028 metawiki.pagelinks - T153300 [09:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:26] !log Stop MySQL Replication on db1064 for maintenance - T153743 [09:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:30] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:36:18] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#3000898 (10elukey) Back to the original task :) So from T111575, it seems that there is a valid reason to have both eqiad and codfw listed in all the nutcracker's config... [09:36:35] !log Removed 2fa from an account, per T157191 [09:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:22] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50376 bytes in 0.003 second response time [09:38:22] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50376 bytes in 0.014 second response time [09:39:32] PROBLEM - HHVM processes on mwdebug1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [09:39:42] PROBLEM - DPKG on mwdebug1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:39:42] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50376 bytes in 0.005 second response time [09:39:46] moritzm: ---^ working on it? [09:39:54] (03CR) 10Hashar: "Looks good. A nit is reusing a makefile provided by dpkg-dev that handles version parsing for you :-} Man page looks good!" (033 comments) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [09:40:07] !log elasticsearch - reindexing from 2017-02-04T20:00:00Z to 2017-02-05T23:59:00Z - T139043 [09:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:11] T139043: nested RemoteTransportExceptions filled the disk on elastic1036 and elastic1045 during a rolling restart - https://phabricator.wikimedia.org/T139043 [09:40:22] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/0/0: down - Core: asw-esams:xe-3/0/42 (GBLX leg 2) {#14007} [10Gbps DF CWDM C49]BR [09:41:32] RECOVERY - HHVM processes on mwdebug1002 is OK: PROCS OK: 6 processes with command name hhvm [09:41:42] RECOVERY - DPKG on mwdebug1002 is OK: All packages OK [09:41:42] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 79576 bytes in 1.598 second response time [09:42:22] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.074 second response time [09:42:22] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 0.137 second response time [09:42:32] elukey: yeah, that's me [09:42:51] okok just wanted to double check, thanks :) [09:44:32] I've depooled it, need it for some more tests [09:54:52] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [09:55:50] <_joe_> Amir1: I am going to deploy the change now [09:56:14] _joe_: Thanks [09:56:21] <_joe_> Let [09:56:27] <_joe_> s see what that does [09:58:42] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:15] !log cache_maps: upgrade to jessie 8.7 and reboot into kernel 4.4.2-3+wmf8 T155401 [10:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:19] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [10:14:41] !log Started to transfer commonswiki (ibd and cfg) from db1064 to labsdb1011 - T153743 [10:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:45] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [10:17:16] !log data import complete for wdqs1003, repooling - T152643 [10:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:21] T152643: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643 [10:17:40] <_joe_> Amir1: do you remember how can I deploy to canaries with scap3? [10:18:15] !log oblivian@tin Started deploy [changeprop/deploy@ac11ebe]: Deploying ores concurrency/disabling to canary [10:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:32] _joe_: ideally it should be "scap deploy" and it deploy the first part [10:18:33] <_joe_> I am doing it with -l [10:18:45] once it's done it should ask for all nodes [10:18:59] ores is like this [10:19:15] in mw nodes, we need to pull it in the canary node directly [10:19:39] <_joe_> ok [10:19:49] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#3000988 (10Gehel) 05Open>03Resolved [10:19:53] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#3000991 (10Gehel) [10:20:36] !log oblivian@tin Started deploy [changeprop/deploy@ac11ebe]: Deploying ores concurrency/disabling [10:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:14] !log oblivian@tin Finished deploy [changeprop/deploy@ac11ebe]: Deploying ores concurrency/disabling (duration: 00m 38s) [10:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:01] <_joe_> Amir1: how hard would it be for you to do what I asked in the ticket? [10:26:19] James_F: avalaible? [10:26:52] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2021_v4, cp2021_v6 [10:27:23] IPsec spam is me ^ [10:27:31] <_joe_> I was about to ask [10:27:46] _joe_: what thing? [10:27:58] <_joe_> https://phabricator.wikimedia.org/T157206#3000507 [10:28:52] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [10:31:13] _joe_: The second part is not super easy but we can simply use hadoop to see what's making api.php requests too. There are some ways. Let me find it. Adam made those queries [10:31:31] wrt the third part. I think we already have a patch in operation/config [10:31:42] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1046_v4, cp1046_v6 [10:31:43] *mediawiki-config [10:32:02] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1046_v4, cp1046_v6 [10:32:02] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1046_v4, cp1046_v6 [10:32:02] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1046_v4, cp1046_v6 [10:32:12] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1046_v4, cp1046_v6 [10:32:22] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1046_v4, cp1046_v6 [10:32:22] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1046_v4, cp1046_v6 [10:32:42] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:32:42] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [10:33:02] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [10:33:02] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 28 ESP OK [10:33:02] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 28 ESP OK [10:33:13] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [10:33:22] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 28 ESP OK [10:33:22] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 28 ESP OK [10:37:33] _joe_: Found it [10:37:33] yess [10:37:33] https://www.irccloud.com/pastebin/0RtOp8Of/ [10:37:37] <_joe_> Amir1: the number of requests is down btw [10:38:08] <_joe_> Amir1: are you able to run it? I should've been doing other things today tbh [10:38:41] <_joe_> and remember not to post results here [10:40:17] _joe_: sure [10:40:17] I have to be in a meeting in several minutes [10:40:17] Of course [10:40:17] but then I'll do it [10:40:22] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50376 bytes in 0.002 second response time [10:40:22] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50376 bytes in 0.010 second response time [10:41:22] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.096 second response time [10:41:22] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 0.112 second response time [10:41:26] <_joe_> Amir1: I have found the "offender", I think [10:48:08] (03CR) 10Jcrespo: [C: 04-1] "This should be packaged on its own .deb, not part of gerrit. And blobs shouldn't be on git." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [10:49:38] _joe_: Do you want to ban it? [10:49:42] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:49:56] <_joe_> Amir1: yes [10:50:19] <_joe_> Amir1: it's scraping oresscores, he is actually the only user of the API AFAICT [10:50:54] _joe_: yes, we haven't announced this functionality yet! [10:51:10] <_joe_> so yeah, I want to ban it [10:55:01] bah, we are getting overload errors again [10:55:06] https://grafana.wikimedia.org/dashboard/db/ores?panelId=9&fullscreen [10:55:17] _joe_: have you banned it? [10:55:29] Hi. [10:55:38] <_joe_> Amir1: nope, I'm still trying to determine the best way to do it [10:55:49] mafk: try perhaps through quarry.wmflabs.org, querying the page table [10:56:00] (03CR) 10Muehlenhoff: "Since mariadb/mysql should still be fully compatible on the client side, isn't using the already packaged libmysql-java an option?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [10:56:25] okay, thanks. I hope I could be more useful [10:57:01] (03PS1) 10Elukey: Enable Yarn's Node Manager recovery to allow graceful restarts [puppet/cdh] - 10https://gerrit.wikimedia.org/r/336203 (https://phabricator.wikimedia.org/T156932) [10:57:39] Dereckson: well, that can work too but I've already filled a task to run namespaceDupes since that'll fix the title and will let us modify the page, etc. [10:59:31] jouncebot: next [10:59:32] In 3 hour(s) and 0 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170206T1400) [11:00:02] (03CR) 10Elukey: "From https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html it seems that:" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/336203 (https://phabricator.wikimedia.org/T156932) (owner: 10Elukey) [11:06:03] <_joe_> Amir1: actually, I'd say we should ban that IP in ores with a blacklist [11:08:55] Dereckson: https://quarry.wmflabs.org/query/16188 gives no results [11:09:42] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [11:12:02] _joe_: sorry, was in the daily scrum. Can you elaborate more? These requests are coming from mediawiki nodes because someone is hitting the API in wikipedia and that sends request to ores. So ores can't block it [11:12:17] <_joe_> why? [11:12:28] <_joe_> anyways, let me do some more verifications [11:12:53] <_joe_> Amir1: ores can block it alright [11:12:54] because the mediawiki nodes also hit ORES API for damage detection and other useful things too [11:13:13] <_joe_> yes, you can just block that X-Client-IP [11:13:33] so they send the client IP in header already? [11:13:38] mafk: I'll have a look around 12:00 UTC [11:13:49] <_joe_> yes [11:14:57] nice [11:15:25] Dereckson: sure [11:15:37] <_joe_> Amir1: but, I think this is a red herring [11:15:43] now there is another problem. ORES don't have any block system. The only way for now is varnish ban [11:15:53] <_joe_> this client does 100 requests per second [11:16:00] <_joe_> per minute [11:16:00] <_joe_> sorry [11:16:07] <_joe_> I don't think it's the cause of this surge [11:16:47] <_joe_> let me look at the api logs before the surge [11:17:59] <_joe_> yes, those requests were there well before the surge [11:18:24] (03PS2) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) [11:19:38] (03CR) 10jerkins-bot: [V: 04-1] WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) (owner: 10Gehel) [11:19:42] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [11:20:48] marostegui: ping - hola [11:21:43] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 2 others: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#3001202 (10Joe) From my further analysis of logs: - there is one API heavy hitter, whose rate of consu... [11:24:42] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3006_v4, cp3006_v6 [11:24:42] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3006_v4, cp3006_v6 [11:24:52] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3006_v4, cp3006_v6 [11:25:02] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3006_v4, cp3006_v6 [11:25:25] mafk: hi, we are in a meeting right now, is it important? [11:25:47] marostegui: if you're in a meeting it certainly can wait [11:26:19] mafk: ok, thanks :) [11:26:42] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 36 ESP OK [11:26:43] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 36 ESP OK [11:26:52] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [11:27:02] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 36 ESP OK [11:41:42] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:41:59] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 2 others: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#3001307 (10Joe) So, graphing `ores.*.scores_request.*.count` it shows most requests seem to come from `... [11:43:03] (03PS1) 10ArielGlenn: dataset1001 rsync to labs of dumps can now use explicit inclusion list [puppet] - 10https://gerrit.wikimedia.org/r/336204 (https://phabricator.wikimedia.org/T154798) [11:46:01] (03PS1) 10ArielGlenn: remove python script for cron rsync of dumps from dataset1001 to labstore [puppet] - 10https://gerrit.wikimedia.org/r/336205 [11:48:02] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 2 others: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#3001311 (10Joe) scratch what I said; the counter for etwiki is most likely broken. The surge in reque... [11:50:23] 06Operations: Fix config file handling for /etc/hhvm/php.ini - https://phabricator.wikimedia.org/T157306#3001324 (10MoritzMuehlenhoff) [11:51:14] (03CR) 10Addshore: "I have submitted the following patches which can be merged to turn off the other cron generating these metrics:" [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) (owner: 10Gehel) [11:53:06] gehel ^^ fyi :) [11:54:07] (03PS2) 10ArielGlenn: dumps: More UI cleanup [puppet] - 10https://gerrit.wikimedia.org/r/335684 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [11:54:42] addshore: thanks! I'll merge that once the alerting is also migrated to the new metrics [11:54:57] okay! :) [11:55:31] are you able to merge the old and new metrics? If not it might be worth filing a ticket for someone to do that too! [11:56:25] addshore: dumb question, but why 4 different changes? [11:56:36] addshore: yes, I can try to merge those metrics [11:57:07] 06Operations: Fix config file handling for /etc/hhvm/php.ini - https://phabricator.wikimedia.org/T157306#3001349 (10MoritzMuehlenhoff) [11:57:20] well, there are 2 branches (master & production) production is deployed, master is not. I can merge the master ones now [11:57:28] and then split it into removing the cron and removing the file [11:59:05] Oh, I did not see the master / prod. Why split the cron and the file? [11:59:10] * gehel is trying to learn... [12:01:17] I think it keeps things cleaner, only the cron change needs to be merged to have the required effect. the removing of the file is just cleanup after that [12:01:32] if the cron change needs to be reverted, the revert is smaller etc. [12:01:43] just my weird way ;) [12:02:09] ok, makes some sense. I would probably have done a single change, but why not [12:03:38] !log upgrading mwdebug* and mw1261 to hhvm 3.12.11+dfsg-1+wmf2 [12:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:18] (03PS1) 10Elukey: Raise the retry interval of the Apache/HHVM leak monitor [puppet] - 10https://gerrit.wikimedia.org/r/336212 [12:04:28] apergos: hey, Thanks! When should we merge it? [12:05:25] Amir1: the usual, I need to test it, if you can screenshot the (small) changes to download-index and (choose random) one of the others and shove them up on the ticket that would be good, I've sent email alerady [12:05:35] (03CR) 10Joal: "@elukey: I like the idea of trying it first on a single node." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/336203 (https://phabricator.wikimedia.org/T156932) (owner: 10Elukey) [12:05:42] so if we hear nothing it'll probably be late my Thursday again [12:06:29] (03PS3) 10Addshore: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333936 [12:07:16] apergos: okay, thanks [12:07:23] thank you! [12:08:59] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 2 others: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#3001387 (10Joe) Looking into it better, the api user wasn't a red herring after all; I am going to ban... [12:10:12] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.007 second response time [12:10:42] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [12:11:12] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.079 second response time [12:11:31] (03PS2) 10Elukey: Raise the retry interval of the Apache/HHVM leak monitor [puppet] - 10https://gerrit.wikimedia.org/r/336212 [12:17:01] (03CR) 10Elukey: [C: 032] Raise the retry interval of the Apache/HHVM leak monitor [puppet] - 10https://gerrit.wikimedia.org/r/336212 (owner: 10Elukey) [12:17:17] I want to test --^ [12:17:40] it is horrible I know but something must be done :D [12:18:22] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.214 second response time [12:19:22] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.584 second response time [12:20:14] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 2 others: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#3001406 (10Ladsgroup) Okay, let's block them for now. Until we find a way to only hold-out the abuser. [12:24:13] 06Operations: Fix config file handling for /etc/hhvm/php.ini - https://phabricator.wikimedia.org/T157306#3001410 (10elukey) [12:25:11] mafk: careful with 'Delete the wrong titled page through API', by default it's still de homepage offered in the sandbox [12:25:33] Dereckson: don't understand [12:25:44] mafk: I wonder if /w/index.php?cur=6196320&action=delete wouldn't work by the way [12:26:05] Dereckson: haven't though of that, but API deletion did worked [12:26:16] I'd have prefered the page renamed and deleted the normal way though [12:26:29] I guess restoring it is complicated given that the title is "invalid" [12:26:30] mafk: there is an "api sandbox" to test API requests, or do them [12:26:39] Dereckson: that's what I used [12:26:53] mafk: by default, when you don't fill any paremeter, it offers to delete the main page [12:27:20] mafk: and when you hit previous in your browser, it restores to default values, not to the one you've just filled [12:27:42] Dereckson: deleting the main page is disabled by default I think [12:28:07] ie, I can't go to enwiki and hit delete gives me something like "deleting the main page is prohibited" or so [12:28:14] even with bigdelete rights that I have [12:28:16] ok [12:28:30] I did filled pageid on APISandbox of course :) [12:28:30] it's not because the number of edits by the way? [12:28:51] not sure, I think it might be blocked somewhere else [12:34:41] (03PS1) 10Tim Landscheidt: k8s: Update path to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/336214 (https://phabricator.wikimedia.org/T157243) [12:36:49] <_joe_> jouncebot: next [12:36:49] In 1 hour(s) and 23 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170206T1400) [12:38:24] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.254 second response time [12:39:24] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.649 second response time [12:44:04] (03CR) 10Volans: "Will the file be removed manually on the hosts where was deployed? (as opposed to use ensure absent)" [puppet] - 10https://gerrit.wikimedia.org/r/336205 (owner: 10ArielGlenn) [12:49:15] apergos: https://phabricator.wikimedia.org/T155697#3001493 [12:50:39] Amir1: very nice! thank you [12:50:57] You are very welcome [13:01:00] 06Operations, 07Documentation, 07LDAP: Review list of LDAP groups and document exactly what kind of access they can be allowed to provide - https://phabricator.wikimedia.org/T129788#3001512 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:05:57] 06Operations, 07LDAP: Update/add/remove LDAP entries based on changes to data.yaml - https://phabricator.wikimedia.org/T142819#3001522 (10MoritzMuehlenhoff) I'll extend the data.yaml file to also track users with privileged LDAP access (ops or wmf group), but no production shell access. That way we have the sa... [13:07:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.394 second response time [13:08:25] 06Operations: Expire time on 404 is too high (Wikipedia) - https://phabricator.wikimedia.org/T157214#3001533 (10Aklapper) Hi @Mjbmr, thanks for taking the time to report this! Unfortunately this report lacks some information. If you have time and can still reproduce the problem, please [[ https://www.mediawiki.o... [13:09:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.853 second response time [13:10:19] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why cobalt went down for 1 minute on 2017-02-05 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#3001539 (10Aklapper) [13:19:21] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:19:31] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 1.688 second response time [13:20:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.549 second response time [13:21:42] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#3001567 (10Aklapper) @Paladox: Please do not add random projects to tasks. [13:23:57] !log removing stale puppet lock file on elastic10(22|26) [13:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:21] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:25:21] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:30:05] !log applied https://gerrit.wikimedia.org/r/#/c/336203/ manually to analytics1028 (hadoop worker node) as live test - T156932 [13:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:10] T156932: Investigate if Node Managers can be restarted without impacting running containers - https://phabricator.wikimedia.org/T156932 [13:41:06] (03PS3) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) [13:44:14] !log cache_misc: upgrade to jessie 8.7 and reboot into kernel 4.4.2-3+wmf8 T155401 [13:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:19] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [13:44:41] (03PS1) 10Elukey: Silence apt rsync repo activities [puppet] - 10https://gerrit.wikimedia.org/r/336218 (https://phabricator.wikimedia.org/T132324) [13:45:04] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#3001641 (10Gehel) [13:45:07] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#3001639 (10Gehel) 05Resolved>03Open Re-opening, wdqs100[12] still need a reimage to match this new data path. This was on hold,... [13:47:21] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:47:37] (03CR) 10Rush: [C: 032] k8s: Update path to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/336214 (https://phabricator.wikimedia.org/T157243) (owner: 10Tim Landscheidt) [13:47:44] (03PS1) 10Gehel: wdqs - move data to /srv/wdqs to follow the usual partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/336219 (https://phabricator.wikimedia.org/T144536) [13:48:46] (03PS2) 10Rush: labstore: Remove create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/336157 (owner: 10Tim Landscheidt) [13:49:34] (03PS2) 10Gehel: wdqs - move data to /srv/wdqs to follow the usual partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/336219 (https://phabricator.wikimedia.org/T144536) [13:51:13] (03CR) 10Rush: [C: 032] "I have a few questions for yuvi about the impact here currently before this is merged." [puppet] - 10https://gerrit.wikimedia.org/r/336214 (https://phabricator.wikimedia.org/T157243) (owner: 10Tim Landscheidt) [13:51:53] (03CR) 10Rush: [C: 032] labstore: Remove create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/336157 (owner: 10Tim Landscheidt) [13:56:31] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.564 second response time [13:57:31] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.114 second response time [13:59:02] ^ madhuvishy when you're awake, did we convert all of these over to a per-test-service I thought? still happening I guess? [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170206T1400). [14:00:04] kart_, addshore, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:13] o/ [14:00:19] *waves* I can do swat today (as many of the patches are mine) [14:00:44] o/ [14:01:19] kart_: here? [14:01:31] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:53] greg-g: hashar: perhaps we should add zeljkof to https://wikitech.wikimedia.org/wiki/SWAT_deploys#The_team by the way [14:02:14] Amir1: I'll start with yours! [14:02:36] addshore: thanks. This one is rather important since we are still getting overload errors [14:02:41] I saw :) [14:02:59] Dereckson: definitely :} Be bold? :} [14:03:18] Dereckson: I am adding zeljkof [14:05:25] !log fixed duplicate entries in source.list on db2040 and es2002 (trusty) [14:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:45] Amir1: your change is on mwdebug1002! please check! [14:05:55] on it [14:06:20] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 10scap, 15User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3001755 (10Addshore) Looks like these messages still appear today during EU swat! [14:08:17] addshore: it's okay in every aspect [14:08:25] syncing [14:08:56] (03PS4) 10Addshore: Enable TwoColConflict on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332909 (https://phabricator.wikimedia.org/T155717) [14:09:13] !log addshore@tin Synchronized php-1.29.0-wmf.10/extensions/ORES/extension.json: T157206 [[gerrit:336215|ORES - Remove all (except meta) API funcationality hooks]] (duration: 00m 51s) [14:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:17] T157206: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206 [14:09:22] (03CR) 10Addshore: [C: 032] Enable TwoColConflict on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332909 (https://phabricator.wikimedia.org/T155717) (owner: 10Addshore) [14:09:29] Amir1: should be everywhere! please check! :) [14:09:42] still no kart_ so I'll push on with my patches [14:10:29] addshore: It's okay [14:11:07] (03Merged) 10jenkins-bot: Enable TwoColConflict on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332909 (https://phabricator.wikimedia.org/T155717) (owner: 10Addshore) [14:11:15] (03CR) 10jenkins-bot: Enable TwoColConflict on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332909 (https://phabricator.wikimedia.org/T155717) (owner: 10Addshore) [14:11:21] Dereckson: thanks! [14:11:47] (03PS1) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/336220 (https://phabricator.wikimedia.org/T146468) [14:12:11] PROBLEM - Check systemd state on cp1051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:14:51] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:15:11] PROBLEM - Check systemd state on db1074 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:15:25] (03PS4) 10Addshore: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333936 [14:15:28] (03PS3) 10Addshore: Rm InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333885 (https://phabricator.wikimedia.org/T155995) [14:15:31] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T155717 [[gerrit:332909|Enable TwoColConflict on mw.org]] (duration: 00m 40s) [14:15:32] (03PS4) 10Addshore: Enable InterwikiSorting on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333603 (https://phabricator.wikimedia.org/T155995) [14:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:35] T155717: Deploy TwoColConflict extension to mediawiki.org - https://phabricator.wikimedia.org/T155717 [14:17:01] zeljkof: you're welcome [14:17:48] (03CR) 10Addshore: [C: 032] Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333936 (owner: 10Addshore) [14:19:01] !log Stop MySQL and shutdown db2060 for maintenance - T156161 [14:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:05] T156161: db2060 not accessible - https://phabricator.wikimedia.org/T156161 [14:19:18] zeljkof: around. [14:19:24] (03Merged) 10jenkins-bot: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333936 (owner: 10Addshore) [14:19:31] (03CR) 10jenkins-bot: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333936 (owner: 10Addshore) [14:19:34] Got lagged by something urgent at home :/ [14:19:45] hi kart_! I can do yours after this patch! [14:19:52] addshore: thanks. [14:22:02] kart_: addshore is in charge! :) [14:22:30] zeljkof: okay :) just realized. [14:22:45] Wondering why I didn't get notification of ping. [14:22:52] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3001821 (10Marostegui) @Papaul db2060 is now off so you can proceed whenever you want. Thanks! [14:23:01] addshore: it looks like the branch deployed for the wmde stats is master and not production (don't ask me why) [14:23:08] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:333936|Copy InterwikiSorting settings from wmgWikibaseClientSettings]] (duration: 00m 40s) [14:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:19] addshore: we seem to have lost metrics on wdqs lag since the merge... [14:23:37] gehel: really? O_o that's odd... [14:23:45] addshore: I mean the merge of https://gerrit.wikimedia.org/r/#/c/336206/ and https://gerrit.wikimedia.org/r/#/c/335646/ [14:23:49] and slightly worrying... [14:24:00] production should be specified in puppet [14:24:18] addshore: I'm checking right now... [14:24:32] kart_: I'm doing yours now then! :) [14:24:39] (03PS2) 10Addshore: Deploy Compact Language Links out of beta in French/Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335781 (https://phabricator.wikimedia.org/T157108) (owner: 10KartikMistry) [14:24:45] (03CR) 10Addshore: [C: 032] Deploy Compact Language Links out of beta in French/Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335781 (https://phabricator.wikimedia.org/T157108) (owner: 10KartikMistry) [14:24:46] addshore: cool. I got notification this time. [14:26:31] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:26:32] (03Merged) 10jenkins-bot: Deploy Compact Language Links out of beta in French/Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335781 (https://phabricator.wikimedia.org/T157108) (owner: 10KartikMistry) [14:26:55] kart_: the change is live on mwdebug1002, please check :) [14:27:00] (03CR) 10jenkins-bot: Deploy Compact Language Links out of beta in French/Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335781 (https://phabricator.wikimedia.org/T157108) (owner: 10KartikMistry) [14:27:08] Checking [14:27:21] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: cp1061_v4, cp1061_v6, cp2018_v4, cp2018_v6 [14:27:21] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: cp1061_v4, cp1061_v6, cp2018_v4, cp2018_v6 [14:27:21] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: cp1061_v4, cp1061_v6, cp2018_v4, cp2018_v6 [14:27:21] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: cp1061_v4, cp1061_v6, cp2018_v4, cp2018_v6 [14:27:21] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: cp1061_v4, cp1061_v6, cp2018_v4, cp2018_v6 [14:27:31] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: cp1061_v4, cp1061_v6, cp2018_v4, cp2018_v6 [14:28:11] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: cp1061_v4, cp1061_v6, cp2018_v4, cp2018_v6 [14:28:29] (03PS4) 10Addshore: Rm InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333885 (https://phabricator.wikimedia.org/T155995) [14:28:32] (03PS5) 10Addshore: Enable InterwikiSorting on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333603 (https://phabricator.wikimedia.org/T155995) [14:28:47] (03CR) 10Ema: [C: 031] "I've tested this on a labs instance and it works great. Just one minor question." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333231 (https://phabricator.wikimedia.org/T155807) (owner: 10Faidon Liambotis) [14:28:50] addshore: looks good on fr, checking on nlwiki; [14:28:51] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:28:53] ack [14:29:01] PROBLEM - Host cp2018 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:22] looking into cp2018 [14:29:36] addshore: nl too. so, go ahead. [14:29:48] syncing [14:30:02] addshore: the branch is specified in puppet, but it looks like git::clone only set the branch on initial clone and does just a pull on subsequent updates [14:30:24] !log addshore@tin Synchronized dblists/compact-language-links.dblist: T157108 & T157112 [[gerrit:335781|Deploy Compact Language Links out of beta in French/Dutch Wikipedia]] (duration: 00m 40s) [14:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:29] T157112: Deploy Compact Language Links in Dutch Wikipedia as a non-Beta feature - https://phabricator.wikimedia.org/T157112 [14:30:29] T157108: Deploy Compact Language Links in French Wikipedia a non-Beta feature - https://phabricator.wikimedia.org/T157108 [14:30:38] kart_: ^^ all done [14:30:43] addshore: thanks [14:31:01] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:31:23] (03CR) 10Addshore: [C: 032] Rm InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333885 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [14:31:29] gehel: bah! thats annoying... [14:31:57] addshore: it looks like the only difference between master and production are those 2 last commits. I can just reset the branch to production... [14:32:20] gehel: yup, that should be good [14:32:22] !log cp2018 stuck rebooting, powercycled [14:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:29] ok, will do [14:32:55] (03CR) 10Faidon Liambotis: Setup & configure certspotter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333231 (https://phabricator.wikimedia.org/T155807) (owner: 10Faidon Liambotis) [14:33:13] (03Merged) 10jenkins-bot: Rm InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333885 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [14:33:23] (03CR) 10jenkins-bot: Rm InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333885 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [14:33:41] RECOVERY - Host cp2018 is UP: PING WARNING - Packet loss = 73%, RTA = 36.12 ms [14:33:43] (03PS1) 10Muehlenhoff: First batch of entries for LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/336222 [14:33:52] !log resetting analytics-wmde/scripts on stat1002 to the correct "production" branch [14:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:21] RECOVERY - IPsec on cp4003 is OK: Strongswan OK - 28 ESP OK [14:34:21] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 28 ESP OK [14:34:21] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 28 ESP OK [14:34:31] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [14:34:31] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [14:34:32] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [14:35:01] PROBLEM - Check systemd state on cp1061 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:37:21] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 28 ESP OK [14:39:19] <_joe_> Amir1: your patch didn't stop the requests apparently [14:39:42] <_joe_> Amir1: still getting a ton of multi-revision requests [14:39:58] <_joe_> ok gonna look in varnish [14:40:11] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T155995 [[gerrit:333885|Rm InterwikiSorting settings from wmgWikibaseClientSettings]] PT 1/2 (duration: 00m 41s) [14:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:15] T155995: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995 [14:41:31] PROBLEM - puppet last run on oresrdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:41:59] !log addshore@tin Synchronized wmf-config/Wikibase.php: T155995 [[gerrit:333885|Rm InterwikiSorting settings from wmgWikibaseClientSettings]] PT 2/2 (duration: 00m 39s) [14:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:48] (03CR) 10Addshore: [C: 032] Enable InterwikiSorting on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333603 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [14:43:51] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [14:43:56] aaa [14:44:49] (03Merged) 10jenkins-bot: Enable InterwikiSorting on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333603 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [14:45:48] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: PROD-NOOP [[gerrit:333603|Enable InterwikiSorting on beta]] (duration: 00m 39s) [14:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:10] !log EU SWAT all done! [14:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:45] (03CR) 10jenkins-bot: Enable InterwikiSorting on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333603 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [14:46:50] <_joe_> addshore: apparently Amir1's patch didn't have the expected effect. the mw api is still making requests to ores [14:48:22] *looks* [14:49:11] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:50:11] Amir1: _joe_ var_dump($wgHooks['APIGetAllowedParams']); still shows "ORES\ApiHooks::onAPIGetAllowedParams" for enwiki [14:50:33] looking at the patch it shouldn't, I'm guessing this is something to do with some caching in extension registration? [14:50:57] <_joe_> probably when releasing that patch some preparation was needed, yes [14:51:45] !log upgrading mw1262-mw1265 to hhvm 3.12.11+dfsg-1+wmf2 [14:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:21] addshore: _joe_ https://en.wikipedia.org/w/api.php?action=help&modules=query [14:52:32] it doesn't have ores in it anymore [14:53:07] <_joe_> Amir1: still, I do see calls coming from the api to ores [14:53:31] <_joe_> I tend to think that (and what addshore just said) is a good proof that somehow that's still true [14:53:40] and on tin the hooks seem to still be registered [14:53:43] <_joe_> addshore: where did you verify that? [14:53:51] <_joe_> addshore: tin? ok [14:54:21] <_joe_> addshore: I think we found the culprit [14:54:31] <_joe_> well, Amir did [14:55:00] (03CR) 10Ottomata: [C: 04-2] "This is already done. See the last line of this block. We can abandon this change." [puppet] - 10https://gerrit.wikimedia.org/r/335854 (https://phabricator.wikimedia.org/T153207) (owner: 10Nuria) [14:55:29] Amir1: ores still appears when looking at https://en.wikipedia.org/w/api.php?action=help&modules=query under the meta param! [14:55:51] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [14:55:55] addshore: the meta param is okay. I kept it (as mentioned in the commit message) [14:56:38] addshore: but it appears in https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brecentchanges which it shouldn't even funnier. It doesn't in beta: https://en.wikipedia.beta.wmflabs.org/w/api.php?action=help&modules=query%2Brecentchanges [14:56:49] I'm lost for words of this situation [14:57:39] just went back and checked and the sync looked normal [14:57:51] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:59:54] <_joe_> Amir1: how is that tied to recenchanges? [15:00:01] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:00:33] (03PS4) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) [15:02:24] _joe_: hooks [15:02:31] which I removed them [15:02:46] <_joe_> I was under that impression too :P [15:03:33] Amir1: were they gone when the change was pulled to mwdebug1002? [15:04:17] addshore: I don't know, Mistakenly I checked the query api page [15:04:22] not recent change api page [15:05:22] Bah, wait, I have it... [15:05:37] (03PS1) 10Volans: /home: update my own .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/336223 [15:05:47] I missed the submodule update for the change, so the sync was just the old file again [15:05:54] <_joe_> oh [15:05:56] <_joe_> :) [15:06:00] yeh [15:06:06] https://www.irccloud.com/pastebin/E6GuTpoq/ [15:06:43] okay, lets try this one again then... [15:07:08] (03PS2) 10Volans: /home: update my own .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/336223 [15:07:58] _joe_: Amir1 on mwdebug1002 now [15:08:09] addshore: That wasted half an hour of mine in the last deployment then I added https://wikitech.wikimedia.org/w/index.php?title=SWAT_deploys/Deployers&diff=1359915&oldid=1054989 [15:08:10] looks good to me [15:08:41] (03CR) 10Volans: [C: 032] /home: update my own .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/336223 (owner: 10Volans) [15:08:43] Amir1: I have my own checklist @ https://wikitech.wikimedia.org/wiki/User:Addshore/Deployments removing all the cruft from the list, but still I missed it... [15:09:04] <_joe_> addshore: seems ok, thanks :) [15:09:08] addshore: LGTM [15:09:22] okay, resyncing... [15:09:31] RECOVERY - puppet last run on oresrdb1002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:10:08] (03PS3) 10Gehel: wdqs1001 - move data to /srv/wdqs to follow the usual partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/336219 (https://phabricator.wikimedia.org/T144536) [15:10:12] Nice! [15:10:18] !log addshore@tin Synchronized php-1.29.0-wmf.10/extensions/ORES/extension.json: T157206 [[gerrit:336215|ORES - Remove all (except meta) API funcationality hooks]] (take2) (duration: 00m 54s) [15:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:24] T157206: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206 [15:10:25] Amir1: _joe_ ^^ [15:10:48] sorry for the wasted time there! [15:11:30] Thanks addshore [15:11:46] (03PS1) 10Muehlenhoff: Update to 4.4.46 [debs/linux44] - 10https://gerrit.wikimedia.org/r/336224 [15:13:31] !log cache_text: upgrade to jessie 8.7 and reboot into kernel 4.4.2-3+wmf8 T155401!log cache_maps: upgrade to jessie 8.7 and reboot into kernel 4.4.2-3+wmf8 T155401 [15:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:35] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [15:14:02] https://grafana.wikimedia.org/dashboard/db/ores?panelId=1&fullscreen&from=now-1h&to=now [15:14:05] weeeeeeeeeeeeee [15:15:24] <_joe_> Amir1: ores continues to work well for the wikis? [15:15:48] (03PS1) 10Hoo man: Search index article placeholders up to Q2794 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336225 (https://phabricator.wikimedia.org/T144592) [15:16:05] I tested, yes, I actually reverted some vandalisms using ORES [15:16:07] !log starting reimage of wdqs1001 - T144536 [15:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:11] T144536: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536 [15:17:12] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:17:42] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1001.eqiad.wmnet [15:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:51] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:18:14] (03PS4) 10Gehel: wdqs1001 - move data to /srv/wdqs to follow the usual partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/336219 (https://phabricator.wikimedia.org/T144536) [15:20:01] (03CR) 10Gehel: [C: 032] wdqs1001 - move data to /srv/wdqs to follow the usual partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/336219 (https://phabricator.wikimedia.org/T144536) (owner: 10Gehel) [15:23:24] (03PS1) 10Gehel: wdqs1002 - move data to /srv/wdqs to follow the usual partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/336226 (https://phabricator.wikimedia.org/T144536) [15:25:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I like the patch, but I'd use "config-etcd" as 'system' in the logging." [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/335844 (https://phabricator.wikimedia.org/T134893) (owner: 10Ema) [15:25:59] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#2602875 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1001.eqia... [15:30:28] (03PS2) 10Hoo man: Search index article placeholders on cywiki up to Q2794 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336225 (https://phabricator.wikimedia.org/T144592) [15:32:23] (03PS2) 10Ema: Log etcd connection status [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/335844 (https://phabricator.wikimedia.org/T134893) [15:34:16] (03PS3) 10Ema: Log etcd connection status [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/335844 (https://phabricator.wikimedia.org/T134893) [15:35:24] (03CR) 10Elukey: "After almost two hours of testing on an1028:" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/336203 (https://phabricator.wikimedia.org/T156932) (owner: 10Elukey) [15:38:44] !log installing lcms security updates on mediawiki canaries [15:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:00] !log oblivian@tin Started deploy [changeprop/deploy@5f932a3]: Revert ORES throttling [15:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:49] !log oblivian@tin Finished deploy [changeprop/deploy@5f932a3]: Revert ORES throttling (duration: 03m 49s) [15:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:05] (03PS3) 10Eevans: Enable JMX exporter on RESTBase Staging nodes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/335826 (https://phabricator.wikimedia.org/T155120) [15:46:16] (03PS2) 10Rush: nodepool: disambiguate images/snapshot/labels [puppet] - 10https://gerrit.wikimedia.org/r/335809 (owner: 10Hashar) [15:46:29] !log Stopping Nodepool for maintenance [15:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:51] (03CR) 10Rush: [V: 032 C: 032] nodepool: disambiguate images/snapshot/labels [puppet] - 10https://gerrit.wikimedia.org/r/335809 (owner: 10Hashar) [15:48:31] and doing sudo /usr/local/sbin/puppet-run [15:51:38] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#3002150 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1001.eqiad.wmnet'] ``` and were **ALL** successful. [15:52:11] PROBLEM - Check systemd state on labnodepool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:52:21] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [15:53:36] (03CR) 10Giuseppe Lavagetto: [C: 032] Disable auto-restart for nutcracker when config.yaml changes [puppet] - 10https://gerrit.wikimedia.org/r/335780 (https://phabricator.wikimedia.org/T155755) (owner: 10Elukey) [15:53:59] (03PS2) 10Elukey: Disable auto-restart for nutcracker when config.yaml changes [puppet] - 10https://gerrit.wikimedia.org/r/335780 (https://phabricator.wikimedia.org/T155755) [15:54:31] ^nodepool is known hashar is doing an invasive maint [15:55:01] PROBLEM - IPsec on mc1011 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2029_v4 [15:55:28] ouch this one is me [15:55:34] mc2029 is down for maintenance [15:55:49] !log mc2029 shutdown for DC ops [15:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:39] (03CR) 10Volans: "Some comments inline" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [15:59:01] RECOVERY - IPsec on mc1011 is OK: Strongswan OK - 1 ESP OK [15:59:11] RECOVERY - Check systemd state on labnodepool1001 is OK: OK - running: The system is fully operational [15:59:21] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:01:28] (03PS1) 10Giuseppe Lavagetto: stdlib: upgrade to 4.15.0 [puppet] - 10https://gerrit.wikimedia.org/r/336230 [16:02:31] PROBLEM - Check systemd state on cp3042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:02:40] (03PS9) 10Rush: nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 [16:06:27] <_joe_> gehel: ^^ as promised :P [16:06:41] (03CR) 10Ottomata: [C: 031] Enable Yarn's Node Manager recovery to allow graceful restarts [puppet/cdh] - 10https://gerrit.wikimedia.org/r/336203 (https://phabricator.wikimedia.org/T156932) (owner: 10Elukey) [16:07:00] _joe_: kool! [16:07:45] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Initial commit of etcd-mirror [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/335460 (owner: 10Giuseppe Lavagetto) [16:13:08] for your puppet patches, I guess you can force merge them without waiting for ci [16:15:07] 06Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, and 4 others: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995#3002296 (10Addshore) [16:16:01] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 2 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3002302 (10Addshore) [16:16:07] 06Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, and 4 others: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995#2960964 (10Addshore) 05Open>03Resolved The InteriwkiSorting extension is now deployed on beta sites! Thi... [16:20:35] (03PS5) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) [16:22:45] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#3002304 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1001.eqia... [16:24:21] (03PS2) 10Yuvipanda: k8s: Update path to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/336214 (https://phabricator.wikimedia.org/T157243) (owner: 10Tim Landscheidt) [16:24:25] (03CR) 10Yuvipanda: [V: 032] k8s: Update path to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/336214 (https://phabricator.wikimedia.org/T157243) (owner: 10Tim Landscheidt) [16:25:49] (03CR) 10Yuvipanda: "Yup, this is right. We missed moving the upstart file I guess when moving to debs. Sorry" [puppet] - 10https://gerrit.wikimedia.org/r/336214 (https://phabricator.wikimedia.org/T157243) (owner: 10Tim Landscheidt) [16:32:53] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Puppet compiler: sync newest facts only (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/335670 (https://phabricator.wikimedia.org/T157052) (owner: 10Volans) [16:33:44] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T155907#3002342 (10Cmjohnson) @fgiunchedi The disk has been replaced..i am still on able to access production to add back for you. You may need to do that yourself. [16:33:55] (03CR) 10Giuseppe Lavagetto: [C: 031] Add missing dummy secrets from production [labs/private] - 10https://gerrit.wikimedia.org/r/335643 (owner: 10Volans) [16:35:31] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [16:35:38] !log Nodepool Jessie images are back up. Trusty one is being rebuild.. [16:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:24] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#3002347 (10Cmjohnson) @fgiunchedi I have the ssds on-site. The disk is in a 3.5" internal disk bay and will need to be powered off for the replacement. [16:37:31] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [16:38:59] (03CR) 10Aklapper: [C: 04-1] "> This just ups how long a script can run." [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) (owner: 10Paladox) [16:42:10] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 4 others: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#3002362 (10Joe) 05Open>03Resolved [16:42:40] 06Operations, 15User-Elukey: prometheus-vhtcpd-stats cronspamming if vhtcpd is not running yet - https://phabricator.wikimedia.org/T157353#3002364 (10ema) [16:43:25] (03CR) 10jerkins-bot: [V: 04-1] stdlib: upgrade to 4.15.0 [puppet] - 10https://gerrit.wikimedia.org/r/336230 (owner: 10Giuseppe Lavagetto) [16:44:05] <_joe_> jerkins-bot [16:45:16] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2992982 (10MoritzMuehlenhoff) >>! In T157022#2997068, @fgiunchedi wrote: > Read traffic has been switched over to graphite2001 now and seems to work. > > Note that graphite2001 was... [16:45:29] 06Operations, 10Traffic, 15User-Elukey: prometheus-vhtcpd-stats cronspamming if vhtcpd is not running yet - https://phabricator.wikimedia.org/T157353#3002382 (10ema) [16:50:20] I am still restoring the Trusty image for Nodepool [16:50:36] 06Operations, 06Analytics-Kanban, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#3002407 (10Milimetric) 05Open>03Resolved This will be resolved by upgrading CDH versions, which will happen soon. So... resolved in the future :) [16:51:21] !log Stop MySQL and shutdown db1072 for raid [16:51:24] gah [16:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:40] !log Stop MySQL and shutdown db1072 for raid and BBU replacement - T156226 [16:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:43] T156226: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226 [16:53:50] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.46 [debs/linux44] - 10https://gerrit.wikimedia.org/r/336224 (owner: 10Muehlenhoff) [16:53:51] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:55:21] PROBLEM - Check systemd state on cp3043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:58:48] (03PS1) 10Muehlenhoff: Update to 4.4.47 [debs/linux44] - 10https://gerrit.wikimedia.org/r/336235 [17:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170206T1700). [17:00:55] why is there a puppet swat scheduled for Monday as well? copy&paste error seems [17:02:15] moritzm: must be, feel free to remove (I'm going into a meeting now) [17:02:42] same here, will remove it later on [17:03:02] !log Nodepool/CI back up [17:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:53] 06Operations, 10netops: netops: switch all subnets to use install1002/2002 as DHCP - https://phabricator.wikimedia.org/T156109#3002420 (10Dzahn) a:05akosiaris>03Dzahn [17:05:22] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3002434 (10Ottomata) Ok, talked with Analytics folks about this. These boxes host lots more than just EventLogging data, and we won't be able to ween people off of MySQL for... [17:10:01] !log restbase start deploy of ea980cc5 [17:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:51] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=80%) [17:12:06] 06Operations, 03Scap3: Package + deploy new version of git-fat - https://phabricator.wikimedia.org/T155856#3002474 (10thcipriani) Is there anything on the releng side we need to do to push this forward? @Ottomata are you the right person to bother? :) For context this will likely solve {T147856} and (probabl... [17:12:08] (03PS10) 10Rush: nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 [17:12:47] (03PS11) 10Rush: nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 [17:21:21] PROBLEM - Check systemd state on cp3032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:22:51] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:26:22] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#2547150 (10chasemp) Is there a way to disable an account in LDAP that would then fail for authentication for all ancillary services that check LDAP? We have run into wanting this a few times with spammers who hit Phab... [17:27:24] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#3002514 (10demon) [17:29:13] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3002516 (10jcrespo) [17:29:52] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3002535 (10jcrespo) [17:31:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:31:33] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3002565 (10jcrespo) [17:32:01] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3002516 (10jcrespo) [17:32:15] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#3002568 (10jcrespo) [17:32:17] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#3002567 (10jcrespo) 05stalled>03Open [17:32:32] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:32:51] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [17:36:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:37:46] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#3002584 (10Paladox) @Aklapper gerrit is maintained by releng. Which the task i added is not random. [17:41:24] !log restbase end deploy of ea980cc5 [17:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:31] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:44:21] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:46:14] (03PS3) 10EBernhardson: Configure A/B test for CrossProject search results sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334673 (https://phabricator.wikimedia.org/T149806) (owner: 10DCausse) [17:54:58] (03PS1) 10Tim Landscheidt: k8s: Use same logic for systemd and upstart configuration [puppet] - 10https://gerrit.wikimedia.org/r/336238 [17:55:48] (03CR) 10Tim Landscheidt: "No-op: diff -u <(git show HEAD^:modules/k8s/templates/initscripts/kube-proxy.upstart.erb) <(ruby -e "require 'erb'; @proxy_mode = 'iptable" [puppet] - 10https://gerrit.wikimedia.org/r/336238 (owner: 10Tim Landscheidt) [17:58:06] 06Operations, 10DBA, 10MediaWiki-Change-tagging: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#3002669 (10Marostegui) [17:58:10] 06Operations, 10DBA, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#3002667 (10Marostegui) 05Open>03Resolved The BBU looks good now and has been charging and now fully charged: ``` root@db1072:~# megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^C... [18:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170206T1800). [18:00:05] gehel: A patch you scheduled for Weekly Wikidata query service deployment window is about to be deployed. Please be available during the process. [18:00:36] 06Operations, 10hardware-requests: Analytics AQS cluster expansion - https://phabricator.wikimedia.org/T149920#3002675 (10RobH) 05stalled>03Resolved This was resolved with hardware, setup on T155654. [18:01:56] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#3002686 (10demon) >>! In T145885#3001567, @Aklapper wrote: > @Paladox: Please do not add random projects to tasks... [18:02:08] 06Operations, 10netops: netops: switch all subnets to use install1002/2002 as DHCP - https://phabricator.wikimedia.org/T156109#3002687 (10Dzahn) a:05Dzahn>03mark [18:02:56] 06Operations, 10TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#3002690 (10RobH) a:05RobH>03None [18:03:46] nothing to deploy on wdqs this week... [18:03:52] 06Operations, 07Documentation: update ServerLifecycle page - https://phabricator.wikimedia.org/T87782#3002703 (10RobH) 05Open>03Resolved this page was updated after extensive updates from the ops offsite, I neglected to resolve this task then. [18:06:34] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#3002721 (10RobH) None of those ports are enabled with those descriptions, so seems these were never allocated. Confirmed and resolving task. [18:06:39] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#3002722 (10RobH) 05Open>03Resolved [18:06:56] !log Start to transfer commonswiki ibd and cfg from db1064 to labsdb1010 - https://phabricator.wikimedia.org/T153743 [18:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:43] 06Operations, 10ops-codfw, 10hardware-requests: decomission db2015 - https://phabricator.wikimedia.org/T149102#3002732 (10RobH) 05Open>03Resolved cleared the port description and its already disabled from earlier actions. [18:12:51] Dereckson: around? [18:12:55] 06Operations, 10ops-codfw, 10hardware-requests: decomission db2015 - https://phabricator.wikimedia.org/T149102#3002734 (10RobH) [18:16:21] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 616 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3337526 keys, up 98 days 9 hours - replication_delay is 616 [18:16:21] !log preparing to reimage db2050 T152188 [18:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:26] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [18:17:21] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3334310 keys, up 98 days 9 hours - replication_delay is 0 [18:17:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:19:04] mafk: yes [18:19:24] Dereckson: https://test.wikipedia.org/wiki/File:Rsz_wikibooks-logo-hi.png <-- can you check? [18:20:13] Transparency and font quality look good [18:21:21] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:21:35] mafk: if you started from a SVG, could you as the same time prepare higher resolutions versions? Target 203px (1.5x) and 270px (2x) width [18:22:03] Dereckson: original file was a PNG so I can't I'm afraid [18:22:23] I'm uploading the 'normal' logo [18:24:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:24:36] (03PS1) 10MarcoAurelio: Changing project logo for hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336242 (https://phabricator.wikimedia.org/T157229) [18:24:50] mafk: you can ask for a SVG to tomasz (Odder) [18:26:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:31:21] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:31:31] PROBLEM - Check systemd state on cp3033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:32:12] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#3002773 (10Cmjohnson) A ticket has been created with HP support. I will update task as more information becomes available. Case ID: 5317039408 Case title: Failed Hard Drive Severity 3-N... [18:33:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:36:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:36:36] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#3002775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1001.eqiad.wmnet'] ``` and were **ALL** successful. [18:36:58] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#3002776 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1001.eqia... [18:38:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:42:51] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:43:42] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up ms-fe100[5-7] - https://phabricator.wikimedia.org/T155095#3002784 (10Cmjohnson) a:05Cmjohnson>03fgiunchedi @fgiunchedi these are all yours....lmk if you have any issues. [18:45:31] RECOVERY - Check systemd state on cp3042 is OK: OK - running: The system is fully operational [18:45:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:47:26] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: decommission the old pay-lvs1001/pay-lvs1002 boxes - https://phabricator.wikimedia.org/T156284#3002796 (10Jgreen) a:05Jgreen>03None [18:48:11] RECOVERY - Check systemd state on cp1051 is OK: OK - running: The system is fully operational [18:48:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:50:01] RECOVERY - Check systemd state on cp1061 is OK: OK - running: The system is fully operational [18:50:21] RECOVERY - Check systemd state on cp3032 is OK: OK - running: The system is fully operational [18:50:31] RECOVERY - Check systemd state on cp3033 is OK: OK - running: The system is fully operational [18:50:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:51:21] RECOVERY - Check systemd state on cp3043 is OK: OK - running: The system is fully operational [18:52:34] moritzm, are you isntalling packages? [18:52:40] ^or is someone? [18:52:49] jouncebot: next [18:52:49] In 0 hour(s) and 7 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170206T1900) [18:52:58] jynus: ema is [18:53:05] jynus: yeah that's me [18:53:29] I have systemctl complaining about a service for a package that was just uninstalled [18:53:44] (I think, it may be something else) [18:53:52] ok that should be completely unrelated [18:54:20] are you installing something related to smartd or just gmond? [18:54:21] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:55:01] jynus: I was not actually installing packages but rather removing varnishmedia.service which wasn't supposed to be installed on text machines any longer [18:55:08] ok [18:55:12] then this is unrelated [18:55:18] just happened to show the same error [18:55:31] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:55:58] jynus: yeah the alert could mention which units are failing :) [18:56:14] I think it was either a security issue [18:56:20] or difficult to parse [18:56:23] or both [18:56:33] I remember discussing that [18:56:49] and the proper way to do it was too slow for a nagios check [18:57:49] I see [18:57:51] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:58:23] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3002859 (10Cmjohnson) a:05Cmjohnson>03Dzahn Assigning this to you. [18:59:00] jynus: I'm off, see you tomorrow o/ [18:59:21] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170206T1900). Please do the needful. [19:00:04] ebernhardson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:01:11] RECOVERY - Check systemd state on db1074 is OK: OK - running: The system is fully operational [19:01:12] ok, old trick worked: install and uninstall [19:01:29] for some reason the unit was "loaded" despite not physically present [19:01:38] and despite reloading systemctl [19:01:44] itself [19:01:53] some kind of race condition or something [19:01:56] looks like i have the only patch, will ship it [19:02:23] or maybe just a debian file tracking issue [19:03:23] (03CR) 10EBernhardson: [C: 032] Configure A/B test for CrossProject search results sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334673 (https://phabricator.wikimedia.org/T149806) (owner: 10DCausse) [19:03:30] * mafk_ present for SWAT - network outage [19:03:37] (03PS4) 10EBernhardson: Configure A/B test for CrossProject search results sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334673 (https://phabricator.wikimedia.org/T149806) (owner: 10DCausse) [19:05:41] (03CR) 10EBernhardson: [C: 032] Configure A/B test for CrossProject search results sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334673 (https://phabricator.wikimedia.org/T149806) (owner: 10DCausse) [19:07:11] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:07:26] (03Merged) 10jenkins-bot: Configure A/B test for CrossProject search results sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334673 (https://phabricator.wikimedia.org/T149806) (owner: 10DCausse) [19:07:33] (03CR) 10jenkins-bot: Configure A/B test for CrossProject search results sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334673 (https://phabricator.wikimedia.org/T149806) (owner: 10DCausse) [19:08:08] !log pulled 334673 to mwdebug1002 [19:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:51] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [19:09:57] (03CR) 10EBernhardson: [C: 032] Changing project logo for hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336242 (https://phabricator.wikimedia.org/T157229) (owner: 10MarcoAurelio) [19:10:13] ebernhardson: hi, my patch :) [19:10:20] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Configure A/B test for CrossProject search results sidebar (duration: 00m 49s) [19:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:25] mafk: yup, looks safe enough to ship [19:10:41] ebernhardson: would like to test it at mwdebug1002 if possible [19:10:45] mafk: yup [19:10:48] :) [19:12:44] (03PS2) 10EBernhardson: Changing project logo for hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336242 (https://phabricator.wikimedia.org/T157229) (owner: 10MarcoAurelio) [19:13:27] (03CR) 10Krinkle: [C: 031] Enable wgEnableWANCacheReaper in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335704 (owner: 10Aaron Schulz) [19:13:58] !log preparing to reimage db2045 T152188 [19:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:02] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [19:14:35] (03CR) 10EBernhardson: [C: 032] Changing project logo for hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336242 (https://phabricator.wikimedia.org/T157229) (owner: 10MarcoAurelio) [19:14:46] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#3002925 (10RobH) The replacement SSDs have arrived onsite, and planning for replacing them can take place on this task. [19:15:52] ebernhardson: maybe a full sync. of ~static/ is needed when SCAPing and not just /static/project-logos. We can test later :) [19:16:04] (03Merged) 10jenkins-bot: Changing project logo for hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336242 (https://phabricator.wikimedia.org/T157229) (owner: 10MarcoAurelio) [19:16:54] !log pulled 336242 to mwdebug1002 [19:16:55] (03CR) 10jenkins-bot: Changing project logo for hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336242 (https://phabricator.wikimedia.org/T157229) (owner: 10MarcoAurelio) [19:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:03] mafk: should be able to sync just the one file i would expect [19:18:06] mafk: also your code is now on mwdebug1002 [19:18:22] ebernhardson: yup, I'm checking but I can't see the change in the logo [19:18:48] (with x-wikimedia-debug on mwdebug1002 ofc) [19:19:53] hmm, the returned html still refers to wikibooks.png [19:20:43] I think I remember those changes were un-testable on debug, it's been some time since I don't do a project-logo change [19:21:03] content confirms that mwdebug1002 served the response ... i'm not familiar enough with the logo code to know what is supposed to be done [19:21:08] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#3002974 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1001.eqiad.wmnet'] ``` and were **ALL** successful. [19:22:27] wgLogo is still showing standard wikibooks. hmm [19:22:35] maybe we can sync and see what happens [19:22:55] hmmm? [19:23:13] touching InitialiseSettings.php to encourage a re-cache of configuration didn't help. hmm [19:23:34] requesting source is equest URL:https://hi.wikibooks.org/static/images/project-logos/wikibooks.png [19:23:41] on mwdebug [19:24:00] * Dereckson looks [19:24:55] server:mwdebug1002.eqiad.wmnet [19:25:17] sigh...PEBKAC. i compared head to origin/master but didn't rebase ... sec [19:25:21] print_r($wgLogo); [19:25:21] /static/images/project-logos/wikibooks.png [19:25:37] ebernhardson: I sympathize, I had that once [19:25:46] !log re-pulled 336242 to mwdebug1002 [19:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:51] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [19:25:59] wgLogo looks apropriately set now [19:26:01] it shows now [19:26:05] (git fetch on Tin, scap pull on mwdebug1002) [19:26:12] Dereckson: btw with mwrepl you just have to do `=$wgLogo` no need for print_r and such [19:26:14] (oooops forgot to rebase) [19:26:25] * Dereckson notes [19:27:37] looks good to me then [19:28:04] ok, syncing out [19:29:18] !log ebernhardson@tin Synchronized static/images/project-logos/hiwikibooks.png: First part of Changing project logo for hi.wikibooks.org (duration: 00m 39s) [19:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:16] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Second half of changing project logo for hi.wikibooks.org (duration: 00m 39s) [19:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:59] mafk: looks right with ?debug=1, varnish has cached pages refering to the old one though which i suspect is expected [19:32:33] yep, with ?debug=1 looks good [19:32:49] ebernhardson: purgeList.php should do the rest [19:32:53] It seems to be cached on your browsers [19:32:54] a more "instant fix" might be if projects started with a custom wgLogo, but that logo was symlinked to the default until later [19:33:04] as noted in the deployments page [19:33:10] https://hi.wikibooks.org/wiki/%E0%A4%B9%E0%A4%BF%E0%A4%A8%E0%A5%8D%E0%A4%A6%E0%A5%80_%E0%A4%B5%E0%A5%8D%E0%A4%AF%E0%A4%BE%E0%A4%95%E0%A4%B0%E0%A4%A3 ah nope [19:33:14] I've one with the former logo [19:33:32] Dereckson: purgeList.php [19:33:43] which though? You have to purge all resource loader calls that include that css [19:33:50] or you can just wait for it to time-out [19:33:56] mafk: what do you want to with it? [19:34:03] we can purge the logo URL when it changes [19:34:05] you@terbium:~$ echo "https://en.wikipedia.org/static/images/project-logos/hiwikibooks.png" | mwscript purgeList.php --wiki=hiwikibooks [19:34:09] when the logo *changes* [19:34:23] the logo was already changed... [19:34:26] I see it changed [19:34:41] mafk: purgeList.php sends to Varnish a request to drop content from the cache [19:34:56] with the line you offer, it would drop hiwikibooks.png [19:34:57] mafk: that would only help if the css already pointed to that. For that to work we would have to initialize wikis with the logo set to a per-project logo, and then symlink that custom logo to the default. Later on the symlink could be replaced and the purge would do as desired [19:35:04] hiwikibooks.png is a *new* file [19:35:18] isn't what it's suposed to do? the docs at wikitech says so, I'm just following the docs :) [19:36:03] No, the documentation covers the case "a file changes", here the changed file isn't the logo (wikibooks.png stays the same), but the CSS compiled from several files. [19:36:11] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:36:16] the logo which was previously returned by css was .../wikibooks.png, so we have to wait for all the appropriate css to time out. It looks like the cache time is only set to 5 minutes so we can just wait [19:37:02] Dereckson: so it's just waiting, okay :) [19:37:06] !log restarting db2060 after kernel upgrade [19:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:45] Dereckson and ebernhardson I think it's all done, the new logo, at least while logged in, appears everywhere now [19:40:13] mafk: yea logging in avoids most caching, unfortunately. It will fix itself for logged out users relatively soon [19:40:30] declaring SWAT complete [19:40:37] ctrl-shift-r logged out resolved it [19:40:44] as for others, matter of waiting [19:42:29] thanks for your help, both :) [19:46:32] hi im getting this error [19:46:33] Unable to establish a connection to any database host (while trying "phabricator_project"). All masters and replicas are completely unreachable. [19:46:36] on phabricator [19:46:57] twentyafterfour jynus ^^ [19:47:17] works now [19:47:27] try f5 [19:47:33] that usually works [19:47:42] yep [19:48:13] my f5 seems to turn down the brightness of my keyboared, anyways it works now :) [19:49:14] then function + f5 [19:49:37] it is a shortcut for refresh/reload the page, paladox [19:50:04] initiationg a new http requests that would return, hopefully, a 200 HTTP OK [19:50:31] jynus it's command + r on the mac it seems, i just learned that :) [19:50:47] isn't that, uncached reload? [19:50:59] discard local cache and reload? [19:51:19] probably it depends on the browser [19:51:21] Oh, fn + f5 dosent work. [19:51:40] which browser do you use? [19:51:45] http://support.keepandshare.com/support/solutions/articles/44733-how-can-i-refresh-my-browser- [19:51:49] jynus safari [19:51:54] buh [19:51:54] and chrome sometimes. [19:51:59] non-free stuff [19:52:08] oh, the webkit is opensource. [19:54:33] 06Operations, 10Traffic, 06Wikipedia-Android-App-Backlog, 06Wikipedia-iOS-App-Backlog, and 2 others: Zero: Investigate removing the limit on carrier tagging to m-dot and zero-dot requests - https://phabricator.wikimedia.org/T137990#3003140 (10JMinor) p:05Normal>03Low [19:56:51] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3003162 (10jcrespo) I have restarted MySQL on db2060, as this is probably done now. Waiting for repl. to catch up, and for Manuel to resolve if he thinks it is ok. [20:00:05] tgr, dr0ptp4kt, and bblack: Dear anthropoid, the time has come. Please deploy Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170206T2000). [20:00:16] i'm here [20:00:39] (03PS2) 10Dzahn: site.pp, DHCP: remove cp3011-cp3022 [puppet] - 10https://gerrit.wikimedia.org/r/334005 (https://phabricator.wikimedia.org/T130883) [20:01:03] bblack: around? [20:03:51] PROBLEM - puppet last run on restbase1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:07:24] dr0ptp4kt: should we go on or is bblack absolutely needed to verify the site is running? [20:07:50] (I didn't verify the schedule other than sending out invites, so this is on me) [20:09:16] tgr: i would feel more comfortable with bblack being around in case we hit a bump. [20:11:30] tgr: b/c of this: https://github.com/wikimedia/operations-puppet/blob/31d75212a934fffd7c3dcd78b276cdc93fbdf68e/modules/varnish/files/zerofetch.py [20:11:55] and its friends [20:13:19] dr0ptp4kt: ok. I'll find another time then (and try to communicate it better) [20:13:33] tgr: thx, much appreciated [20:15:03] tgr: somwhat [20:15:25] oh, cool [20:15:31] if zerofetch fails it just sticks with the existing data it had from last success [20:15:35] it's not super critical in that sense [20:16:41] I'll do the deployment then (bblack: it's a fairly boring chanage to JsonConfig, we are just being paranoid due to the possible size of the fallout) [20:17:15] what's the fallout we're worried about? [20:19:14] bblack: my main concern was that we might not fail gracefully. but it sounds like you're saying it should fail gracefully. [20:19:28] I am mostly worried because I have no clue how the thing works :) [20:19:31] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:22:10] (03CR) 10Dzahn: [C: 032] site.pp, DHCP: remove cp3011-cp3022 [puppet] - 10https://gerrit.wikimedia.org/r/334005 (https://phabricator.wikimedia.org/T130883) (owner: 10Dzahn) [20:25:56] the fetcher just syncs json data to a file on disk [20:26:08] it validates the ata before it syncs it into place, so any failure there results in leaving the old data [20:26:15] the runtime stuff consumes that file [20:26:31] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [20:27:01] !log cp3011 thru cp3022 - revoke puppet certs, puppet node deactivate (T130883) [20:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:06] T130883: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883 [20:27:30] if we do break the fetcher, how do we learn of it? is it wired into icinga or logstash or something like that? [20:30:34] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#3003242 (10Gehel) @Cmjohnson most probably a stupid question, but why doesn't the serial number [[ https://phabricator.wikimedia.org/T156663#2994160 | shown by smartctl ]] matches the one... [20:31:58] tgr: there's eventually a check for it, but it takes a while to notice [20:32:01] I can look manually too [20:32:51] RECOVERY - puppet last run on restbase1018 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:33:28] tgr: is it deployed yet? [20:34:20] bblack: few more minutes (CI was sluggish) [20:39:02] !log tgr@tin Synchronized php-1.29.0-wmf.10/extensions/JsonConfig/includes/JCUtils.php: T155532: Update JsonConfig login API call (duration: 01m 00s) [20:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:09] T155532: Zerowiki (using JsonConfig) is using deprecated login method resulting in logspam - https://phabricator.wikimedia.org/T155532 [20:39:10] bblack: ^ [20:39:31] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:39:52] dr0ptp4kt: ^ [20:39:59] tgr: is it everyhere? [20:40:03] er, everywhere [20:40:37] sounds like the next time such a change could just ride the train, since the worst that can happen is that Zero stops syncing and there is an alarm about it [20:40:41] dr0ptp4kt: should be [20:41:03] well assuming you made zerofetch totally fail or get invalid json data [20:41:07] tgr: thx. that's mostly the case, so long as there isn't a data corruption issue in the actual app server or memcached/redis layer! [20:41:25] I think, it's probably possible to silently "fail" by giving it legitimately structured data that doesn't have the right info of course :) [20:42:00] the server I happen to be looking at will execute its natural cronjob in 2 minutes, so watching that [20:42:20] tgr: i'm gonna update a config and see what happens next [20:42:49] thinking about it, this change was to the API pull logic inside JsonConfig so it would fail on some lower level than the fetcher script [20:43:07] dr0ptp4kt: wait please [20:43:08] I'm a bit unclear on the whole architecture [20:43:16] dr0ptp4kt: just to see if it syncs ok with a no-op (no change to data) [20:43:27] bblack: good principle, will do that first [20:43:59] zerowiki pulls from meta, and the fetcher script pulls from zerowiki to update actual config for load balancers(?), is that correct? [20:44:30] ok got a successful no-op sync [20:44:46] (it succeeded in pulling down new json data, but compared identical to old and left it in place, and updated its last-success timestamp) [20:45:04] i just did a no-op save of the config [20:45:08] last I heard, zerowiki's data was updated on zerowiki [20:45:19] (edited there, I thought) [20:45:23] yes, these days the data are in zerowiki [20:45:38] bblack: wanna try again? [20:45:53] so what's the wiki-to-wiki data copy logic in JsonConfig used for? [20:46:00] dr0ptp4kt: just did, still ok [20:46:05] bblack: thx [20:46:08] it does something since it threw warnings [20:46:11] now i'll add an ip [20:46:18] errrr wait [20:46:22] to testproxy or something? [20:46:27] tgr: what were the warnings? [20:46:54] (i'm not yet adding the ip) [20:46:59] before the patch, I mean [20:47:05] oh [20:47:09] ok, any warnings atm? [20:48:29] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:48:41] tgr: seeing any warnings or fatals? [20:48:50] nothing jumps out [20:49:00] tgr: thx. about to add ip [20:49:41] !log cp3011 thru cp3022 - shutdown / poweroff (T130883) [20:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:45] T130883: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883 [20:49:49] (03CR) 10Hashar: "Madhumitha, Yuvi pointed me to you :)" [puppet] - 10https://gerrit.wikimedia.org/r/333230 (https://phabricator.wikimedia.org/T155820) (owner: 10Hashar) [20:51:39] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 1.252 second response time [20:51:59] bblack: would you please check now? [20:53:28] dr0ptp4kt: the data for carriers (but not proxies) changed. hard for me to tease out the diff of what changed, though. [20:53:39] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.025 second response time [20:54:22] ah I managed a useful diff now [20:54:29] a new IP ending in .93, added to carrier TEST1 [20:54:33] yep [20:54:46] everything seems sane then [20:54:52] how long until that propagates to all the varnishes again? up to 15 mins? [20:55:17] yes, up to 15m [20:55:36] every varnish runs the update cron once every 15 minutes, and they're all at differently-staggered timeslots [20:55:39] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:56:05] tgr: we'll need maybe an additional 5-6 mins for this window [20:56:49] tgr: logs still good? [20:57:55] dr0ptp4kt: next window is services so it shouldn't interfere with a MW deployment [20:58:16] tgr: cool [20:59:15] (03PS1) 10Hashar: Remove Gemfile.lock [puppet] - 10https://gerrit.wikimedia.org/r/336262 [20:59:19] PROBLEM - Host cp3022 is DOWN: PING CRITICAL - Packet loss = 100% [20:59:33] ^ that's a host mutante is decomming [21:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170206T2100). Please do the needful. [21:00:07] (03CR) 10Hashar: "The Gemfile.lock removal is now in https://gerrit.wikimedia.org/r/#/c/336262/" [puppet] - 10https://gerrit.wikimedia.org/r/332981 (owner: 10Hashar) [21:01:00] (03PS3) 10Hashar: Gemfile: add xmlrpc for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 [21:03:13] dr0ptp4kt: nothing rises above the default background noise of weird hhvm errors, in any case :-/ [21:05:53] tgr: bblack it's working. dance [21:06:14] tgr: your comment reminded me of cosmic background noise. [21:07:39] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:29] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [21:08:31] cool, thanks both [21:08:59] I still wish we had a document explaining WTF the JsonConfig fetch code actually does for Zero [21:09:21] I donno, but [21:09:43] there was a pseudo-plan at some point in the past, to move the management of Proxies (but not Carriers) data to metawiki instead of zerowiki [21:09:52] so that the community could maintain a broader proxy database there [21:09:59] (like the XFF list in MW) [21:10:12] I don't think that plan ever went anywhere [21:12:14] https://phabricator.wikimedia.org/T89838 [21:12:16] well, JsonConfig does a fetch about once an hour, if that's not actually for anything, it would be nice to get rid of it [21:12:35] I'll just file a bug about that [21:13:39] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.743 second response time [21:14:39] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.670 second response time [21:15:34] ^ madhuvishy :D [21:15:54] chasemp: gah, patching now [21:17:10] (03CR) 10Hashar: [C: 031] "Nodepool config has been adjusted to better differentiate snapshot instances and regular instances. So all good to me now :]" [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [21:27:28] Hey, we're losing SAL history currently with this sequence of edits: [21:27:31] (cur | prev) 20:40, 18 December 2016‎ LegoFan4000 (talk | contribs)‎ . . (empty) (-486,673)‎ . . (revert revert) (undo | thank) [21:27:34] (cur | prev) 18:41, 18 December 2016‎ Luke081515 (talk | contribs)‎ . . (486,673 bytes) (+486,673)‎ . . (revert) (undo | thank) [21:27:37] (cur | prev) 18:35, 18 December 2016‎ LegoFan4000 (talk | contribs)‎ . . (empty) (-486,698)‎ . . (Archiving) (undo | thank) [21:27:56] I noticed we don't have an archive for end of 2016. [21:28:03] o.O [21:28:16] Dereckson: what's the story, someone editing SAL page directly w/ conflicts or something? [21:28:19] I reverted the original edit, but did not noticed his second one [21:29:39] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.739 second response time [21:29:56] A vandal, probably impersonating lego.ktm cleared the August-December 2016 part of the SAL. It did that twice, the first time reverted, the second time unnoticed. [21:30:17] So we never thought early January to move the page to create an archive. [21:30:34] I think I did the archiving last year [21:30:39] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.543 second response time [21:31:10] Dereckson: 13:09 Mark asked me to ask you, heh [21:31:10] 13:09 I have a response hat is mostly a link and not meant to push you off [21:31:13] 13:09 https://wikitech.wikimedia.org/wiki/Labs_labs_labs/Bare_Metal [21:31:16] oops, bad paste [21:31:23] heh [21:31:26] i wanted to paste this [21:31:32] https://wikitech-static.wikimedia.org/wiki/Server_Admin_Log [21:31:41] Dereckson: LegoFan4000 is actually someone else from another wiki community [21:31:48] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2841012 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db2050.codfw.wmnet', 'db2045.codfw.wmnet'] ``` The log can... [21:31:52] does the SAL on "static" help? [21:33:39] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.355 second response time [21:34:39] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.706 second response time [21:35:34] !log bsitzmann@tin Started deploy [mobileapps/deploy@9b42448]: Update mobileapps to 034a391 [21:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:45] !log otto@tin Started deploy [eventstreams/deploy@c938a57]: (no justification provided) [21:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:39] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:37:32] !log otto@tin Finished deploy [eventstreams/deploy@c938a57]: (no justification provided) (duration: 01m 47s) [21:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:58] (03Draft1) 10Paladox: Testing: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/336304 [21:39:01] (03PS2) 10Paladox: Testing: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/336304 [21:39:33] !log bsitzmann@tin Finished deploy [mobileapps/deploy@9b42448]: Update mobileapps to 034a391 (duration: 03m 59s) [21:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:28] Sagan: mutante: yes, we can restore the SAL page, but we should do it and the archiving in a quieter moment, when deployments of the day are done [21:40:29] (03PS1) 10Madhuvishy: toolschecker: Increase check interval for grid job submit tests [puppet] - 10https://gerrit.wikimedia.org/r/336310 [21:40:32] legoktm[NE]: ok [21:43:07] (03PS3) 10Paladox: Testing: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/336304 [21:44:39] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.286 second response time [21:45:09] (03PS4) 10Paladox: Testing: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/336304 [21:45:39] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 1.418 second response time [21:46:39] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.602 second response time [21:47:39] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.528 second response time [21:47:52] (03CR) 10Rush: [C: 032] toolschecker: Increase check interval for grid job submit tests [puppet] - 10https://gerrit.wikimedia.org/r/336310 (owner: 10Madhuvishy) [21:51:55] (03PS3) 10Dzahn: remove cp3011-cp3022, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/334015 (https://phabricator.wikimedia.org/T130883) [21:54:12] (03PS13) 10Yuvipanda: tools: Use docker engine profile in tools builder [puppet] - 10https://gerrit.wikimedia.org/r/335299 [21:54:40] (03CR) 10Yuvipanda: [V: 032] tools: Use docker engine profile in tools builder [puppet] - 10https://gerrit.wikimedia.org/r/335299 (owner: 10Yuvipanda) [21:55:00] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Use docker engine profile in tools builder [puppet] - 10https://gerrit.wikimedia.org/r/335299 (owner: 10Yuvipanda) [21:57:44] (03PS9) 10Yuvipanda: tools: Use docker profile for k8s worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/335957 [21:58:11] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Use docker profile for k8s worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/335957 (owner: 10Yuvipanda) [22:00:04] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170206T2200). [22:00:10] (03PS5) 10Yuvipanda: tools: Fix puppet on docker builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/335970 [22:00:19] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Fix puppet on docker builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/335970 (owner: 10Yuvipanda) [22:02:19] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3003497 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2050.codfw.wmnet', 'db2045.codfw.wmnet'] ``` and were **ALL** successful. [22:03:01] (03PS2) 10Yuvipanda: tools: Turn on docker live-migrate for docker builder [puppet] - 10https://gerrit.wikimedia.org/r/335972 (https://phabricator.wikimedia.org/T157180) [22:03:10] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Turn on docker live-migrate for docker builder [puppet] - 10https://gerrit.wikimedia.org/r/335972 (https://phabricator.wikimedia.org/T157180) (owner: 10Yuvipanda) [22:03:57] (03PS2) 10Yuvipanda: tools: Remove unused docker related files [puppet] - 10https://gerrit.wikimedia.org/r/335974 [22:04:23] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Remove unused docker related files [puppet] - 10https://gerrit.wikimedia.org/r/335974 (owner: 10Yuvipanda) [22:06:20] (03CR) 10Dzahn: [C: 032] "all shutdown and gone from Icinga now" [dns] - 10https://gerrit.wikimedia.org/r/334015 (https://phabricator.wikimedia.org/T130883) (owner: 10Dzahn) [22:07:42] (03Draft1) 10Paladox: Allow us to disable opcache validate thing, this is needed on a development server [puppet] - 10https://gerrit.wikimedia.org/r/336329 [22:07:46] (03PS2) 10Paladox: Allow us to disable opcache validate thing, this is needed on a development server [puppet] - 10https://gerrit.wikimedia.org/r/336329 [22:07:55] (03PS3) 10Paladox: Allow us to disable opcache validate thing, this is needed on a development server [puppet] - 10https://gerrit.wikimedia.org/r/336329 [22:10:01] (03PS1) 10Platonides: Show again svwiki logo between 1.5x and 2x zoom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336330 (https://phabricator.wikimedia.org/T157387) [22:11:55] (03CR) 10Paladox: [C: 031] "Tested on the puppet master and seems to work :)" [puppet] - 10https://gerrit.wikimedia.org/r/336329 (owner: 10Paladox) [22:12:27] (03PS8) 10Dzahn: extdist: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334284 (owner: 10Juniorsys) [22:13:01] (03CR) 10Dzahn: "https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines" [puppet] - 10https://gerrit.wikimedia.org/r/336329 (owner: 10Paladox) [22:13:56] (03PS4) 10Paladox: Phabricator: Allow us to disable opcache validate in php.ini [puppet] - 10https://gerrit.wikimedia.org/r/336329 [22:13:58] (03CR) 10Dzahn: [C: 032] extdist: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334284 (owner: 10Juniorsys) [22:14:59] (03CR) 10Dzahn: "@Juniorsys i'll just leave the labs-related changes (openstack, quarry, labs-modules) to labs admins" [puppet] - 10https://gerrit.wikimedia.org/r/334284 (owner: 10Juniorsys) [22:15:25] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:21:13] 06Operations, 10ops-esams, 10hardware-requests, 13Patch-For-Review: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#3003529 (10Dzahn) servers are now removed from puppet, salt and DNS (except mgmt) and have been shutdown. physical decom at the dc can follow [22:23:23] Phabricator now works with jessie :), confirmed by https://phab-01.wmflabs.org (daemons works), and everything else :) [22:23:24] (03CR) 1020after4: [C: 031] Phabricator: Allow us to disable opcache validate in php.ini [puppet] - 10https://gerrit.wikimedia.org/r/336329 (owner: 10Paladox) [22:23:29] mutante twentyafterfour ^^ [22:25:45] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3003530 (10Dzahn) a:05Dzahn>03None [22:26:25] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:26:40] Could not find data item profile::docker::engine::settings in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/docker/engine.pp:9 on node copper.eqiad.wmnet [22:28:41] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#3003536 (10Paladox) Migrations complete for phab-01 -> phabricator (labs instance). Phabricator officially currently works... [22:32:06] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#3003542 (10Krinkle) >>! In T156922#2992486, @fgiunchedi wrote: >... [22:32:29] 06Operations, 03Scap3: Package + deploy new version of git-fat - https://phabricator.wikimedia.org/T155856#3003544 (10Ottomata) I can help! [22:32:43] 06Operations, 10Analytics, 03Scap3: Package + deploy new version of git-fat - https://phabricator.wikimedia.org/T155856#3003545 (10Ottomata) [22:33:44] (03CR) 10Dzahn: "looks like this broke on copper: Error 400 on SERVER: Could not find data item profile::docker::engine::settings in any Hiera data file a" [puppet] - 10https://gerrit.wikimedia.org/r/335299 (owner: 10Yuvipanda) [22:35:37] mutante thanks for finding it. I'm not sure how to fix this in the new puppet hiera coding guidelines tho. [22:36:01] I am out eating now but will take a look soon [22:36:30] yuvipanda: alright, cool [22:40:13] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#3003556 (10Dzahn) [22:40:17] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3003555 (10Dzahn) [22:43:29] 06Operations, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3003562 (10thcipriani) p:05Normal>03High ping! Has anyone been able to look into this? This is currently one of the many... [22:43:39] (03PS5) 10Paladox: Phabricator: Allow us to disable opcache validate in php.ini [puppet] - 10https://gerrit.wikimedia.org/r/336329 [22:46:26] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:49:46] 06Operations, 10Analytics, 03Scap3: Package + deploy new version of git-fat - https://phabricator.wikimedia.org/T155856#3003573 (10thcipriani) >>! In T155856#3003544, @Ottomata wrote: > I can help! yay! Thanks in advance. Feel free to poke me in IRC if you have questions/problems/need a post-deploy checker. [22:52:24] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3003575 (10Dzahn) I was just using these as random test hosts for the installer. trying on 1003 though if things changed. [22:54:25] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [22:57:28] !log Purged https://en.wikipedia.org/static/apple-touch/wikipedia.png (mwscript purgeList.php) for T152538 [22:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:32] T152538: Update apple touch icon - https://phabricator.wikimedia.org/T152538 [23:04:28] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3003611 (10Dzahn) a:03Dzahn [23:10:12] (03PS1) 10RobH: fix icinga.wikimedia.org ssl check [puppet] - 10https://gerrit.wikimedia.org/r/336335 [23:10:35] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:11:41] (03CR) 10Dzahn: [C: 031] fix icinga.wikimedia.org ssl check [puppet] - 10https://gerrit.wikimedia.org/r/336335 (owner: 10RobH) [23:13:05] (03CR) 10RobH: [C: 032] fix icinga.wikimedia.org ssl check [puppet] - 10https://gerrit.wikimedia.org/r/336335 (owner: 10RobH) [23:14:25] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [23:34:09] (03CR) 10Paladox: "> Since mariadb/mysql should still be fully compatible on the client" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [23:34:39] (03CR) 10Paladox: "> This should be packaged on its own .deb, not part of gerrit. And" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [23:36:35] (03CR) 10Paladox: "> This should be packaged on its own .deb, not part of gerrit. And" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [23:37:23] (03CR) 10Volans: "Answers inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/335670 (https://phabricator.wikimedia.org/T157052) (owner: 10Volans) [23:37:44] (03CR) 10Dduvall: [C: 031] "Seems reasonable for the use case you're proposing." [puppet] - 10https://gerrit.wikimedia.org/r/336262 (owner: 10Hashar) [23:38:06] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:38:36] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [23:40:55] (03PS4) 10EBernhardson: [WIP] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 [23:41:01] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 (owner: 10EBernhardson) [23:42:02] (03PS2) 10Dzahn: Silence apt rsync repo activities [puppet] - 10https://gerrit.wikimedia.org/r/336218 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [23:45:23] (03PS2) 10Paladox: Add mariadb-java-client [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) [23:45:43] (03PS4) 10Paladox: Gerrit: Use the mariadb plugin instead of mysql [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) [23:45:52] (03PS5) 10Paladox: Gerrit: Use the mariadb plugin instead of mysql [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) [23:45:53] !log prometheus1003 - installed OS, signing puppet cert, initial run (T152504) [23:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:59] T152504: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504 [23:46:13] (03PS5) 10EBernhardson: [WIP] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 [23:46:20] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 (owner: 10EBernhardson) [23:47:58] 06Operations, 10Stashbot: [IDEA] Backup bot for morebots - https://phabricator.wikimedia.org/T148694#2730198 (10Krinkle) Per T156929, Adminbot has been transitioned to Stashbot. Stashbot no longer relies on the wiki to save its messages (they're saved to ElasticSearch and also exposed via a 06Operations, 10Stashbot: [IDEA] Backup bot for morebots - https://phabricator.wikimedia.org/T148694#2730198 (10EBernhardson) I'm not familiar with Stashbot, so perhaps my concerns are unfounded, but if we are use Elasticsearch as a primary data store I wonder if we have appropriate regular backups of the data... [23:51:19] (03CR) 10Paladox: "> This should be packaged on its own .deb, not part of gerrit. And" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [23:51:59] (03CR) 10Dzahn: [C: 032] Silence apt rsync repo activities [puppet] - 10https://gerrit.wikimedia.org/r/336218 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [23:52:31] (03CR) 10Paladox: [C: 031] "> This should be packaged on its own .deb, not part of gerrit. And" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [23:52:52] (03CR) 10Paladox: [C: 031] "@Jcrespo ^^" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [23:52:54] (03CR) 10Dzahn: "thanks, i meant to redirect this to a log file at first, but you are right, then i just need to worry about rotating it." [puppet] - 10https://gerrit.wikimedia.org/r/336218 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [23:55:52] 06Operations, 10Stashbot: [IDEA] Backup bot for morebots - https://phabricator.wikimedia.org/T148694#2730198 (10bd808) >>! In T148694#3003789, @EBernhardson wrote: > I'm not familiar with Stashbot, so perhaps my concerns are unfounded, but if we are use Elasticsearch as a primary data store I wonder if we have... [23:56:16] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3003803 (10Dzahn) prometheus1003 has an OS now and has been added to puppet and can be used prometheus1004 looks like it's still in the wrong VLAN, the public one, but it should be in pri... [23:59:47] 06Operations, 10Stashbot: [IDEA] Backup bot for morebots - https://phabricator.wikimedia.org/T148694#3003826 (10bd808) >>! In T148694#3003768, @Krinkle wrote: > Anyway, not sure how high the need for it is, but might be interesting regardless. Alternatively, perhaps stashbot could have a restricted web interfa...