[00:00:08] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/313158/2 (duration: 01m 57s) [00:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:01:12] !log maxsem@tin Synchronized wmf-config/: (no message) (duration: 00m 52s) [00:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:02:48] (03PS1) 10MaxSem: wmg --> wg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314206 [00:03:00] (03CR) 10MaxSem: [C: 032] wmg --> wg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314206 (owner: 10MaxSem) [00:03:28] (03Merged) 10jenkins-bot: wmg --> wg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314206 (owner: 10MaxSem) [00:05:03] !log maxsem@tin Synchronized wmf-config/mobile.php: https://gerrit.wikimedia.org/r/#/c/314206/1 (duration: 00m 49s) [00:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:05:54] brrr [00:20:15] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:34:34] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:01:37] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:16:21] alvaromolina alvaro basura basura asco platonicos [01:18:32] PROBLEM - HP RAID on ms-be1022 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [01:20:54] RECOVERY - HP RAID on ms-be1022 is OK: OK: Slot 3: OK: 2I:4:2, 2I:4:1, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [01:23:55] PROBLEM - MariaDB Slave Lag: m3 on db1043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1425.88 seconds [01:35:14] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=512 [critical =500] [01:40:04] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=745 [critical =500] [01:41:25] RECOVERY - MariaDB Slave Lag: m3 on db1043 is OK: OK slave_sql_lag Replication lag: 0.90 seconds [01:42:31] missing_thank_yous? [01:45:14] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=617 [critical =500] [01:46:35] phabricator? [01:55:14] RECOVERY - check_missing_thank_yous on db1025 is OK: OK missing_thank_yous=0 [02:37:29] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.20) (duration: 13m 47s) [02:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:13:09] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.21) (duration: 19m 19s) [03:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:20:16] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Oct 5 03:20:16 UTC 2016 (duration 7m 7s) [03:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:38:49] [21:50] Microsoft Ripped out the new engine they were going to use and used it in Microsoft Edge instead - [21:51] Seems microsoft edge has bugs [03:39:34] in case you certainly sure that a bug is upstream in edgehtml, fyi https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/ ;) [05:01:18] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:07:46] (03Abandoned) 1001tonythomas: Lift IP throttling for Amrita University in meta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313648 (owner: 1001tonythomas) [05:18:36] 06Operations, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 5 others: Package apertium (and dependencies) for Jessie - https://phabricator.wikimedia.org/T107306#2691651 (10KartikMistry) [05:25:03] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:26:22] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [05:51:27] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:00:36] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:01:25] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:14:57] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:16:52] 06Operations, 10ops-eqiad, 10DBA: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2691692 (10Marostegui) Hey @Cmjohnson Looks like the disk failed again (same slot), could this be the disk bay or even worse...the controller itself? ``` Adapter #0 Enclosure Device ID: 32 Slot Number: 0... [06:19:55] 06Operations, 10ops-eqiad, 10DBA: db1065: Degraded RAID - https://phabricator.wikimedia.org/T147396#2691693 (10Marostegui) [06:21:26] ACKNOWLEDGEMENT - MegaRAID on db1065 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Marostegui https://phabricator.wikimedia.org/T147396 [06:25:05] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:25:37] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=58%) [06:25:46] 06Operations, 10DBA, 10MediaWiki-General-or-Unknown, 13Patch-For-Review: img_metadata queries for PDF files saturates s4 slaves - https://phabricator.wikimedia.org/T147296#2691706 (10Marostegui) Thanks @aaron - once it is pushed I will keep an eye on the graphs to see if this mitigate the spikes [06:30:49] (03PS2) 10Giuseppe Lavagetto: hiera: allow searching for the full key when using expand_path [puppet] - 10https://gerrit.wikimedia.org/r/312206 [06:32:29] (03CR) 10jenkins-bot: [V: 04-1] hiera: allow searching for the full key when using expand_path [puppet] - 10https://gerrit.wikimedia.org/r/312206 (owner: 10Giuseppe Lavagetto) [06:50:36] (03PS3) 10Giuseppe Lavagetto: hiera: allow searching for the full key when using expand_path [puppet] - 10https://gerrit.wikimedia.org/r/312206 [06:51:09] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [06:53:51] RECOVERY - Disk space on ocg1002 is OK: DISK OK [06:59:25] 06Operations, 10ops-codfw, 10DBA: db2017 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T145844#2691717 (10Marostegui) Thanks @Papaul, everything looks good now! ``` Device Present ================ Virtual Drives : 1 Degraded : 0 Offline... [06:59:37] 06Operations, 10ops-codfw, 10DBA: db2017 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T145844#2691718 (10Marostegui) 05Open>03Resolved [07:01:01] (03PS1) 10Muehlenhoff: Migrate mediawiki-firejail-convert to mediawiki-converters.profile [puppet] - 10https://gerrit.wikimedia.org/r/314233 (https://phabricator.wikimedia.org/T145811) [07:03:36] (03CR) 10Marostegui: "This looks good to me. However, if we are in process of converting all the tables to InnoDB I would make this change until we have finishe" [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [07:08:43] !log reimaging mw1176-mw1178 to jessie [07:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:11:47] 06Operations, 10OCG-General, 13Patch-For-Review: Tons of OCG jobs caused a massive increase in queue length - https://phabricator.wikimedia.org/T147211#2691727 (10Joe) Since the deployment of the last changes, I was looking at disk usage and I noticed that there are a ton of deleted and unclosed files creat... [07:12:01] (03CR) 10Paladox: "hi, thanks, I believe @Jcrespo converted the rest of the tables to innodb yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [07:12:12] (03CR) 10Paladox: "Marostegui" [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [07:12:20] (03PS7) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) [07:13:53] (03CR) 10Marostegui: "If we are done with the InnoDB tables, then I would say we can go ahead and see how this improves/degrades the search service." [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [07:15:49] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:28:54] (03PS4) 10Giuseppe Lavagetto: hiera: always search for the full key [puppet] - 10https://gerrit.wikimedia.org/r/312206 [07:39:13] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2691741 (10akosiaris) [07:39:37] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2691755 (10akosiaris) [07:40:14] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2691741 (10akosiaris) [07:44:18] (03PS5) 10Giuseppe Lavagetto: hiera: always search for the full key [puppet] - 10https://gerrit.wikimedia.org/r/312206 [07:47:06] (03PS2) 10Muehlenhoff: Fix quoting for br_netfilter kmod configuration [puppet] - 10https://gerrit.wikimedia.org/r/314035 [07:51:31] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Docker installation for production kubernetes - https://phabricator.wikimedia.org/T147181#2691770 (10Joe) [07:54:52] 06Operations, 10Mobile-Content-Service, 06Parsing-Team, 06Services, and 3 others: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551#2691772 (10Joe) a:05Joe>03None [07:55:32] 06Operations, 10scap, 13Patch-For-Review, 03Scap3, 15User-mobrovac: Scap::server::sources is out of sync with the repositories actually present on tin/mira - https://phabricator.wikimedia.org/T143692#2691774 (10Joe) This is mostly done; next step would be to make scap_source verify the origin and change... [07:56:21] 06Operations, 06Services-next, 15User-Joe, 15User-mobrovac, 05codfw-rollout: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2691775 (10Joe) a:05Joe>03None [07:56:44] 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: redis not up after reboot on rcs machines - https://phabricator.wikimedia.org/T130147#2691777 (10Joe) 05Open>03Resolved [07:57:43] !log reimaging mw120[01] to Debian Jessie (mw1201 is a scap proxy) [07:57:43] 06Operations, 07Puppet, 15User-Joe: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#2691778 (10Joe) a:05Joe>03None [07:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:58:07] 06Operations, 07Puppet, 13Patch-For-Review, 15User-Joe: Kill manifests/realm.pp - https://phabricator.wikimedia.org/T85459#2691780 (10Joe) a:05Joe>03None [07:59:37] (03CR) 10Alexandros Kosiaris: [C: 031] hiera: always search for the full key [puppet] - 10https://gerrit.wikimedia.org/r/312206 (owner: 10Giuseppe Lavagetto) [08:00:04] kart_ and akosiaris: Dear anthropoid, the time has come. Please deploy Apertium migration to Jessie (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161005T0800). [08:00:04] kart_: A patch you scheduled for Apertium migration to Jessie is about to be deployed. Please be available during the process. [08:00:21] 06Operations, 10scap, 13Patch-For-Review, 03Scap3, and 2 others: Scap::server::sources is out of sync with the repositories actually present on tin/mira - https://phabricator.wikimedia.org/T143692#2691784 (10Joe) [08:00:23] kart_: I 'll start the migrateion [08:01:51] akosiaris: okay! [08:02:02] (03PS3) 10Alexandros Kosiaris: apertium: Enable it on SCB [puppet] - 10https://gerrit.wikimedia.org/r/310311 (https://phabricator.wikimedia.org/T147288) [08:02:06] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] apertium: Enable it on SCB [puppet] - 10https://gerrit.wikimedia.org/r/310311 (https://phabricator.wikimedia.org/T147288) (owner: 10Alexandros Kosiaris) [08:08:08] 06Operations, 10Ops-Access-Requests, 10netops: elukey - Access to network devices - https://phabricator.wikimedia.org/T147061#2691796 (10elukey) @ema also would like to get access to the network equipment, maybe we could couple both requests in one task. I'd also really like to get added to the noc email lis... [08:08:55] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314235 (https://phabricator.wikimedia.org/T145533) [08:10:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314235 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [08:10:27] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314235 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [08:10:37] akosiaris: https://integration.wikimedia.org/ci/job/operations-puppet-doc/26839/console - failure? [08:10:56] !log disable puppet on scb1001, scb1002, scb2001, scb2002 [08:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:11:12] !log T147288 disable puppet on scb1001, scb1002, scb2001, scb2002 [08:11:13] T147288: Migrate apertium to SCB - https://phabricator.wikimedia.org/T147288 [08:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:11:46] kart_: puppet rdoc ? ... jenkins is acting up again [08:12:10] kart_: nothing to see here... moving along [08:12:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1082 from 100 to 300 (duration: 00m 52s) [08:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:02] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:03] akosiaris: okay! [08:15:17] hmm.. that ^ one though is interesting [08:15:41] <_joe_> akosiaris: that can be cxserver being slow [08:15:50] <_joe_> because apertium is slow? [08:16:36] or something has changed in the apertium responses [08:16:49] note that CXserver now uses the apertium on jessie [08:17:05] implicitly unfortunately due to the apertium.svc.eqiad.wmnet IP being on the box [08:17:26] :-( [08:19:26] /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) is CRITICAL: Could not fetch url http://10.64.0.16:8080/v1/mt/en/es/Apertium: Generic connection error: HTTPConnectionPool(host=u'10.64.0.16', port=8080): Max retries exceeded with url: /v1/mt/en/es/Apertium (Caused by ReadTimeoutError("HTTPConnectionPool(host=u'10.64.0.16', port=8080): Read timed out. (read timeout=5)",)) [08:20:35] (03CR) 10Muehlenhoff: [C: 032] Fix quoting for br_netfilter kmod configuration [puppet] - 10https://gerrit.wikimedia.org/r/314035 (owner: 10Muehlenhoff) [08:20:40] (03PS3) 10Muehlenhoff: Fix quoting for br_netfilter kmod configuration [puppet] - 10https://gerrit.wikimedia.org/r/314035 [08:20:49] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [08:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:23:13] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:25:03] 06Operations, 10DBA, 13Patch-For-Review: db1019: Decommission - https://phabricator.wikimedia.org/T146265#2691818 (10Marostegui) It still needs to be deleted from DNS now that I think about it. We can probably do this at the same time it gets delete from the array of hosts of db-equiad|codfw.php files. [08:28:20] !log reimaging mw1172,mw1179, mw1180 to jessie [08:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:32:56] akosiaris: are we good now? [08:35:10] kart_: no [08:35:30] for some reason I can't yet figure out swagger is failing [08:35:35] swagger checker [08:35:56] {"name":"cxserver","hostname":"scb1001","pid":218,"level":50,"levelPath":"error","msg":"MT processing error: Error: ETIMEDOUT\n at null._onTimeout (/srv/deployment/cxserver/deploy-cache/revs/97db738dfb500ef54fea8583d2bbba75b45e7fd9/node_modules/preq/node_modules/request/request.js:759:15)\n at Timer.listOnTimeout (timers.js:92:15)","time":"2016-10-05T08:30:59.559Z","v":0} [08:39:06] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 6 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:41:38] santhosh_: Nikerabbit Do you know above error^^ [08:42:52] !log installing PHP security updates on Ubuntu systems [08:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:45:03] it's clearly timing out trying to connect to apertium [08:45:17] but I can't reproduce it with a curl command ... [08:46:22] 06Operations, 05Prometheus-metrics-monitoring: Port apache httpd metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147316#2691860 (10fgiunchedi) @elukey indeed the scoreboard isn't parsed :( I took a stab at parsing it, the results look like this ``` $ curl localhost:9117/metrics -s | g... [08:46:26] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:48:31] why would it time out... is the input someshow special? [08:50:52] strange [08:51:22] akosiaris: apertium-apy service is OK? [08:51:50] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Docker installation for production kubernetes - https://phabricator.wikimedia.org/T147181#2683877 (10Joe) p:05Triage>03High a:03Joe [08:52:20] kart_: on scb1001 ? unsure.. looks like it.. it will return on all the basic checks... but those are manual.. so something might be wrong [08:52:30] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2691879 (10Joe) [08:52:49] btw, we have no downtime.. user requests are served normally. In case that was not clear [08:52:49] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2652695 (10Joe) a:05Joe>03None [08:52:50] ACKNOWLEDGEMENT - cassandra service on maps-test2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Gehel reimage in progress [08:52:50] ACKNOWLEDGEMENT - tileratorui on maps-test2002 is CRITICAL: Connection refused Gehel reimage in progress [08:52:51] ACKNOWLEDGEMENT - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Gehel reimage in progress [08:52:51] ACKNOWLEDGEMENT - tileratorui on maps-test2003 is CRITICAL: Connection refused Gehel reimage in progress [08:52:52] ACKNOWLEDGEMENT - cassandra service on maps-test2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Gehel reimage in progress [08:52:52] ACKNOWLEDGEMENT - tileratorui on maps-test2004 is CRITICAL: Connection refused Gehel reimage in progress [08:53:12] ^ sorry for the spam... [08:53:32] 06Operations, 07RfC, 15User-Joe, 07discovery-system, 05services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#2691897 (10Joe) a:05Joe>03None [08:56:26] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [08:56:47] akosiaris: there are some error in apertium-apy [08:56:59] akosiaris: see service status of apertium-apy [08:57:41] kart_: the conversion error ? [08:57:56] yes [08:58:09] it's 1 line and I have no idea what it means [08:58:12] (I restarted service) [08:58:16] yes. [08:59:53] OK. It seems it happens in en-es package. Checking. [09:01:27] akosiaris: got it. [09:01:37] what is it ? [09:01:45] akosiaris: apertium-en-es package is old. Need Jessie package. [09:02:00] apertium-en-es 0.6.0-1.1+b2 [09:02:14] wat ? [09:02:16] while, we need, 0.8.0+svn~57502-2+wmf1 [09:02:30] er... wat on ... [09:02:36] that should not have happened [09:02:50] yep. init.pp looks OK (it is in base package) [09:02:53] apt.wikimedia.org is preferred [09:03:00] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2691932 (10Joe) [09:03:06] (Of, https://gerrit.wikimedia.org/r/#/c/308679) [09:03:12] it should have gotten that instead of the one more mirrors.wikimedia.org [09:03:43] akosiaris: apt show only fetches old version. [09:04:12] kart_: do you mean apt-cache show ? [09:04:29] yes [09:04:39] it shows both [09:04:42] https://apt.wikimedia.org/wikimedia/pool/main/a/apertium-en-es/ is good. [09:05:01] my bad. I used apt show [09:05:37] ah, did not use the -a flag [09:05:41] In any case, we need to use newer version. [09:05:51] akosiaris: where? [09:05:59] apt show -a apertium-en-es [09:06:17] I am trying to figure out why the apt.wikimedia.org version was not preferred [09:06:23] this should not have happened [09:06:28] yeah [09:07:10] apertium-en-es apertium-es-pt apertium-oc-ca apertium-oc-es [09:07:17] all of these packages exhibit the same [09:07:19] weird... [09:07:25] 06Operations, 07Puppet: Change behaviour of expand_path in hiera lookups. - https://phabricator.wikimedia.org/T147403#2691959 (10Joe) p:05Triage>03Normal a:03Joe [09:08:31] akosiaris: Can we only install packages from jessie-wikimedia on scb* and check? [09:08:43] only ? [09:08:48] I lost you [09:08:59] (03CR) 10Giuseppe Lavagetto: [C: 031] "The change is effectively a noop, so technically sound." [puppet] - 10https://gerrit.wikimedia.org/r/312206 (owner: 10Giuseppe Lavagetto) [09:09:23] Oh, we need other repo too. I mean priority to jessie-wikimedia (not apt.w.o only) [09:09:24] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:10:00] the entirety of apt.wikimedia.org is (supposed to be?) preferred [09:10:51] ah.. hmm [09:10:56] this is starting to make some sense [09:10:59] digging a bit more [09:11:13] a found it [09:11:17] weird... [09:11:28] these packages were preexisting on scb1001 [09:11:46] !log repooling varnish-be-rand on cp2014 and cp1073 T147209 [09:11:47] T147209: etcd cluster has Raft Internal errors sporadically - https://phabricator.wikimedia.org/T147209 [09:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:12:03] ok.. that explains it.. we probably had role::sca applied on scb hosts at some point [09:12:11] (03PS6) 10Giuseppe Lavagetto: hiera: always search for the full key [puppet] - 10https://gerrit.wikimedia.org/r/312206 (https://phabricator.wikimedia.org/T147403) [09:12:13] and apertium packages got installed. [09:12:21] but the apt.wikimedia.org packages did not exist back then [09:12:26] what a mess... [09:12:30] ok cleaning up [09:12:38] akosiaris: oh :) [09:14:14] kart_: good catch btw. thanks! [09:14:23] :) [09:14:36] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:14:38] 06Operations, 07Puppet, 13Patch-For-Review, 15User-Joe: Change behaviour of expand_path in hiera lookups. - https://phabricator.wikimedia.org/T147403#2691992 (10Joe) [09:15:23] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:15:53] _joe_: ^ that is Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Attempt to assign to a reserved variable name: 'trusted' on node wtp1004.eqiad.wmnet [09:16:53] (03CR) 10Giuseppe Lavagetto: [C: 031] MW apache: remove bits.wm.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/305536 (https://phabricator.wikimedia.org/T107430) (owner: 10BBlack) [09:17:12] <_joe_> akosiaris: wat? [09:17:30] <_joe_> akosiaris: so the only explanation I have is that that was a puppetdb failure of some kind [09:17:43] transient btw... [09:17:51] \o/ :P [09:17:54] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:18:26] (03CR) 10Giuseppe Lavagetto: [C: 031] puppetdb: Only allow connection from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/312513 (owner: 10Alexandros Kosiaris) [09:19:17] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [09:19:23] kart_: ^ ok [09:19:30] moving forward with the rest of the boxes [09:19:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [09:20:35] akosiaris: thanks [09:21:29] akosiaris: looks good! [09:21:49] !log enable puppet on scb1002. T147288 [09:21:50] T147288: Migrate apertium to SCB - https://phabricator.wikimedia.org/T147288 [09:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:22:04] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [09:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:22:37] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [09:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:23:27] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM! spot-checked a couple of swift hosts" [puppet] - 10https://gerrit.wikimedia.org/r/312206 (https://phabricator.wikimedia.org/T147403) (owner: 10Giuseppe Lavagetto) [09:24:33] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [09:25:26] (03CR) 10Jcrespo: [C: 04-1] "This is not good enough. First, it has to be tested with such a lower value on production. Second, things that are done with aria have to " [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [09:26:17] (03CR) 10Paladox: "Ok, would we be able to do a separate patch that lowers innodb to 3 please instead of the default 4?" [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [09:26:27] (03PS8) 10Giuseppe Lavagetto: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [09:29:54] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [09:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:30:12] <_joe_> mobrovac: ^^ should be cherry-pickable again [09:30:33] oh cool [09:30:35] grazie _joe_ [09:30:57] <_joe_> I break it I pay it :) [09:32:09] (03PS2) 10Alexandros Kosiaris: conftool: Add apertium to scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/313965 (https://phabricator.wikimedia.org/T147288) [09:32:12] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] conftool: Add apertium to scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/313965 (https://phabricator.wikimedia.org/T147288) (owner: 10Alexandros Kosiaris) [09:32:23] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:36:09] (03CR) 10Alexandros Kosiaris: [C: 032] conftool: Set the apertium services on all scb nodes [puppet] - 10https://gerrit.wikimedia.org/r/313966 (https://phabricator.wikimedia.org/T147288) (owner: 10Alexandros Kosiaris) [09:36:13] (03PS2) 10Alexandros Kosiaris: conftool: Set the apertium services on all scb nodes [puppet] - 10https://gerrit.wikimedia.org/r/313966 (https://phabricator.wikimedia.org/T147288) [09:36:15] (03CR) 10Alexandros Kosiaris: [V: 032] conftool: Set the apertium services on all scb nodes [puppet] - 10https://gerrit.wikimedia.org/r/313966 (https://phabricator.wikimedia.org/T147288) (owner: 10Alexandros Kosiaris) [09:37:29] (03CR) 10Paladox: "@Jcrespo ^" [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [09:38:09] (03PS2) 10Alexandros Kosiaris: lvs: Migrate apertium to scb [puppet] - 10https://gerrit.wikimedia.org/r/313967 (https://phabricator.wikimedia.org/T147288) [09:38:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] lvs: Migrate apertium to scb [puppet] - 10https://gerrit.wikimedia.org/r/313967 (https://phabricator.wikimedia.org/T147288) (owner: 10Alexandros Kosiaris) [09:38:31] 06Operations, 10Continuous-Integration-Infrastructure, 10Monitoring, 13Patch-For-Review, 07Technical-Debt: Remove Ganglia Jenkins plugin from gallium - https://phabricator.wikimedia.org/T147065#2692065 (10hashar) 05Open>03Resolved a:03hashar @Dzahn merged the change and double checked the cleanup o... [09:39:23] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium']) [09:39:28] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium']) [09:39:41] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2002.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=apertium']) [09:39:45] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2001.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=apertium']) [09:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:40:01] !log pool all scb hosts for apertium service [09:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:43:57] !log restart pybal on lvs1006, lvs1009, lvs1012, lvs2006 T147288 [09:43:58] T147288: Migrate apertium to SCB - https://phabricator.wikimedia.org/T147288 [09:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:05] (03CR) 10Hashar: "That did not have the expected result. Errors in /var/log/mediawiki/jobrunner.log still shows HTML/skinned output :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312077 (owner: 10Hashar) [09:48:08] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2692083 (10Joe) As far as our preferred option, backporting from stretch/sid goes, A very simple attempt yielded we'd need to build/ma... [09:54:09] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2692087 (10Joe) Building the debian packages from the docker sources needs a recent docker version and still downloads all of its depe... [09:54:29] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1050, Errmsg: Error Table TO_DROP_hitcounter already exists on query. Default database: frwiki. Query: [snipped] [09:55:33] I will fix that [09:55:34] 06Operations, 10ops-eqiad, 10DBA: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2692088 (10Marostegui) I have been trying to debug the issue to see if there is a disk bay or controller problem. Unfortunately the drac isn't giving much information after checking every single checkable tar... [10:01:12] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2692099 (10Joe) So, as far as I can see, we could do one of the following: 1) import packages from dockerhub, or even use their own r... [10:02:34] akosiaris: I'm about to attempt again at reimaging bast3001 FYI, in 5 min or so [10:02:55] ACKNOWLEDGEMENT - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1050, Errmsg: Error Table TO_DROP__counters already exists on query. Default database: ruwiki. Query: [snipped] Marostegui This is now fixed [10:03:13] godog: ok [10:05:48] !log restart pybal on lvs1003, lvs2003 T147288 [10:05:49] T147288: Migrate apertium to SCB - https://phabricator.wikimedia.org/T147288 [10:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:57] kart_: and we are fully done ^ [10:06:05] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [10:07:46] akosiaris: cool. Let me update cxserver now. [10:08:29] kart_: what kind of update does cxserver need ? [10:08:36] * akosiaris curious [10:08:46] aaah the new languages ? [10:10:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] "stall this for 2-3 days or so" [puppet] - 10https://gerrit.wikimedia.org/r/313968 (https://phabricator.wikimedia.org/T147288) (owner: 10Alexandros Kosiaris) [10:12:08] !log reimage bast3001 with /srv partition scheme [10:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:12:50] akosiaris: yep [10:12:58] (03PS1) 10Gehel: osm - use osm2pgsql from jessie-backport [puppet] - 10https://gerrit.wikimedia.org/r/314244 [10:14:38] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/314244 (owner: 10Gehel) [10:17:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] osm - use osm2pgsql from jessie-backport (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314244 (owner: 10Gehel) [10:18:02] !log Update cxserver to 0b2c3fa (T144588) [10:18:03] T144588: Update packaging configuration for Jessie migration - https://phabricator.wikimedia.org/T144588 [10:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:19:00] (03CR) 10Muehlenhoff: osm - use osm2pgsql from jessie-backport (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314244 (owner: 10Gehel) [10:20:04] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3095514 keys - replication_delay is 0 [10:21:08] akosiaris: thanks a lot. [10:21:20] akosiaris: took months, but we're done \0/ [10:21:51] :-) [10:25:37] (03CR) 10Alexandros Kosiaris: osm - use osm2pgsql from jessie-backport (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314244 (owner: 10Gehel) [10:26:28] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2692141 (10elukey) Idea to discuss: ``` !log adding mw120[01] back to the mw api live pool after reimage [10:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:33:32] MW api scap proxy back in business [10:34:00] (03CR) 10Muehlenhoff: osm - use osm2pgsql from jessie-backport (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314244 (owner: 10Gehel) [10:35:21] 06Operations, 10vm-requests: EQIAD|CODFW: (2) VM request for zotero - https://phabricator.wikimedia.org/T147409#2692155 (10akosiaris) [10:36:26] 06Operations, 10vm-requests: EQIAD|CODFW: (2) VM request for zotero - https://phabricator.wikimedia.org/T147409#2692168 (10akosiaris) [10:41:30] !log reimaging mw1181-mw1183 to jessie [10:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:36] moritzm: btw I'm poking at dhcp/pxelinux on carbon and thus puppet is stopped, reimaging should work as normal tho [10:48:25] currently trying to understand why atftp says "connection refused" while trying to serve the pxelinux.cfg after ldlinux.c32 [10:49:37] ok [10:50:21] lunch & [10:58:55] (03CR) 10Jcrespo: "I am not sure what you mean, innodb's min token side is by default 3: http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar" [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [11:15:22] 06Operations, 10DBA, 13Patch-For-Review: db1019: Decommission - https://phabricator.wikimedia.org/T146265#2692192 (10jcrespo) >>! In T146265#2688549, @Marostegui wrote: > @jcrespo can you confirm if it can be deleted from there without breaking the site as the header states? :-) Create a CR, add me as a rev... [11:17:43] 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2692193 (10Gilles) Yeah it had the same limitation when I tried locally. [11:23:45] !log reimaging mw1184-mw1186 to jessie [11:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:00] (03PS1) 10Ema: varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 [11:39:02] (03CR) 10jenkins-bot: [V: 04-1] varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 (owner: 10Ema) [11:40:28] (03PS2) 10Ema: varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 [11:41:33] (03CR) 10jenkins-bot: [V: 04-1] varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 (owner: 10Ema) [11:41:39] matanya: ugh, thanks [11:44:28] !log reimaging mw120[67] to Debian Jessie [11:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:50:57] (03PS3) 10Ema: varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 [11:51:20] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2692233 (10Gilles) [11:51:23] 06Operations, 06Performance-Team, 10Thumbor: 'NoneType' object has no attribute 'lstrip' - https://phabricator.wikimedia.org/T145505#2692232 (10Gilles) 05Open>03Resolved [11:51:58] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2515540 (10Gilles) [11:52:00] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2692236 (10Gilles) 05Open>03Resolved [11:52:18] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2515541 (10Gilles) [11:52:21] 06Operations, 06Performance-Team, 10Thumbor: 0px thumbnail requests should fail more elegantly - https://phabricator.wikimedia.org/T145614#2692238 (10Gilles) 05Open>03Resolved [11:52:31] (03CR) 10jenkins-bot: [V: 04-1] varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 (owner: 10Ema) [11:54:18] (03PS4) 10Ema: varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 [11:55:06] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:57:01] moritzm: could it be wmf-reimage related? --^ [11:57:15] 06Operations, 06Performance-Team, 10Thumbor: Thumbor times out on large files sometimes - https://phabricator.wikimedia.org/T147412#2692241 (10Gilles) [11:57:19] yeah, that's a false positive in the icinga check: [11:57:57] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2515578 (10Gilles) [11:57:58] during long-running operations like the reimages there is a separate salt-minion process during the execution [11:58:00] 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2692256 (10Gilles) 05Open>03Resolved [11:58:16] but the icinga check only expects one [11:58:23] one weird thing is that wmf-reimage has not been executed in my last four reimages [11:58:37] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2515579 (10Gilles) [11:58:39] salt-key related? [11:58:40] 06Operations, 06Performance-Team, 10Thumbor: Thumbor SVG regexp insufficient - https://phabricator.wikimedia.org/T145618#2692258 (10Gilles) 05Open>03Resolved [11:59:17] that happened for a third of my reimages of today [11:59:27] ahh okok so not only to me [12:00:20] moritzm: do you proceed manually or use other tricks to resume the overall reimage process? [12:01:09] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2692262 (10Gilles) [12:01:13] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't load source files bigger than 100MB - https://phabricator.wikimedia.org/T145768#2692261 (10Gilles) 05Open>03Resolved [12:01:46] the only thing failing is the reboot, so to fix it up, you can drop the old salt key, log into the host to start the salt-minion, ack the new salt cert, reboot and trigger a puppet run to finalise the cgroup setup [12:02:20] it fails before the reimage for me, if I log in the console I see ubuntu's loging :( [12:04:35] which host? I can have a look [12:04:56] mw120[0167] [12:04:57] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2692271 (10Gilles) @fgiunchedi would there be an easy to track those OOM kills as a metric? I can't look at syslog myself and check if my late... [12:05:15] I am forcing a powercycle + PXE boot on 6 and 7 [12:05:42] having a look at 1200 [12:05:57] 06Operations, 10ops-eqiad, 10DBA: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2692272 (10Cmjohnson) More than likely it's the disk. These are repurposed disks, I will replace again. [12:06:59] elukey: mw1200 shows a Debian prompt, maybe they only took a bit? [12:07:08] 06Operations, 10ops-eqiad, 10DBA: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2692274 (10Marostegui) Thanks! [12:07:58] moritzm: I forced the powercycle and pxe boot, completed the reimage manually [12:08:04] (before lunch) [12:08:06] ok [12:08:37] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:08:46] There are a few trusty systems which fail to reboot properly, that also affects the ms-fe* systems [12:09:12] I am writing a note on the etherpad for each of them [12:09:13] (03PS1) 10Filippo Giunchedi: install_server: use http/lpxelinux on install hosts [puppet] - 10https://gerrit.wikimedia.org/r/314251 [12:09:19] (03PS1) 10Marostegui: db1019 is going to be decommissioned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314252 (https://phabricator.wikimedia.org/T146265) [12:09:34] (03CR) 10Alexandros Kosiaris: [C: 032] puppetdb: Only allow connection from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/312513 (owner: 10Alexandros Kosiaris) [12:09:40] (03PS3) 10Alexandros Kosiaris: puppetdb: Only allow connection from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/312513 [12:09:42] (03CR) 10Alexandros Kosiaris: [V: 032] puppetdb: Only allow connection from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/312513 (owner: 10Alexandros Kosiaris) [12:10:08] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2001.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=apertium']) [12:10:33] ignore this ^ .. wrong bash history search [12:10:40] and was a noop anyway [12:11:21] (03PS2) 10Gehel: osm - use osm2pgsql from jessie-backport [puppet] - 10https://gerrit.wikimedia.org/r/314244 [12:12:00] PROBLEM - puppet last run on bast3001 is CRITICAL: Connection refused by host [12:12:08] PROBLEM - configured eth on bast3001 is CRITICAL: Connection refused by host [12:12:09] (03CR) 10Gehel: osm - use osm2pgsql from jessie-backport (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314244 (owner: 10Gehel) [12:12:30] PROBLEM - dhclient process on bast3001 is CRITICAL: Connection refused by host [12:12:57] PROBLEM - DPKG on bast3001 is CRITICAL: Connection refused by host [12:13:07] PROBLEM - salt-minion processes on bast3001 is CRITICAL: Connection refused by host [12:13:31] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2692281 (10Gilles) [12:13:34] 06Operations, 06Performance-Team, 10Thumbor: pdf failure - https://phabricator.wikimedia.org/T145617#2692280 (10Gilles) 05Open>03Resolved [12:13:37] PROBLEM - Disk space on bast3001 is CRITICAL: Timeout while attempting connection [12:13:37] PROBLEM - Check size of conntrack table on bast3001 is CRITICAL: Timeout while attempting connection [12:13:37] 06Operations, 06Performance-Team, 10Thumbor: Use intermediary high-quality JPEGs rather than PNGs for PDF thumbnailing - https://phabricator.wikimedia.org/T145637#2692282 (10Gilles) 05Open>03Resolved [12:13:40] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2515613 (10Gilles) [12:13:48] PROBLEM - MD RAID on bast3001 is CRITICAL: Timeout while attempting connection [12:14:18] !log reedy@tin Synchronized php-1.28.0-wmf.21/extensions/UserMerge: Fix fatal when using special page (duration: 00m 50s) [12:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:14:55] (03CR) 10Alexandros Kosiaris: [C: 031] osm - use osm2pgsql from jessie-backport [puppet] - 10https://gerrit.wikimedia.org/r/314244 (owner: 10Gehel) [12:16:01] (03CR) 10Muehlenhoff: [C: 031] osm - use osm2pgsql from jessie-backport [puppet] - 10https://gerrit.wikimedia.org/r/314244 (owner: 10Gehel) [12:16:47] (03CR) 10Ema: [C: 031] Text VCL: remove synth side of Win+Chrome/41 workaround [puppet] - 10https://gerrit.wikimedia.org/r/313828 (https://phabricator.wikimedia.org/T141786) (owner: 10BBlack) [12:18:24] (03CR) 10Ema: [C: 031] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/310557 (https://phabricator.wikimedia.org/T145659) (owner: 10Filippo Giunchedi) [12:23:49] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2652695 (10faidon) >>! In T146171#2653839, @RobH wrote: > I went with the asset tags as hostname, since allocating our diminishing... [12:25:55] is bast3001 still being installed? [12:26:32] yep! [12:26:39] Filippo is working on it [12:26:43] k [12:27:39] RECOVERY - configured eth on bast3001 is OK: OK - interfaces up [12:28:00] RECOVERY - dhclient process on bast3001 is OK: PROCS OK: 0 processes with command name dhclient [12:28:27] RECOVERY - DPKG on bast3001 is OK: All packages OK [12:28:29] RECOVERY - salt-minion processes on bast3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:28:58] RECOVERY - Check size of conntrack table on bast3001 is OK: OK: nf_conntrack is 0 % full [12:28:58] RECOVERY - Disk space on bast3001 is OK: DISK OK [12:29:10] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2692329 (10Gilles) [12:29:13] 06Operations, 06Performance-Team, 10Thumbor: Archive file thumbs not working - https://phabricator.wikimedia.org/T145769#2692328 (10Gilles) 05Open>03Resolved [12:29:18] RECOVERY - MD RAID on bast3001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [12:31:01] 06Operations, 10OCG-General, 13Patch-For-Review: Tons of OCG jobs caused a massive increase in queue length - https://phabricator.wikimedia.org/T147211#2692330 (10cscott) I'm guessing that's most likely caused by the actions I took to clear the queue pre-deploy. But I'll take a look. [12:32:57] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Service[ganglia-monitor-aggregator@3041.service],Service[ganglia-monitor-aggregator@3008.service] [12:34:19] 06Operations, 10Ops-Access-Requests, 10netops: Give elukey/ema access to network devices - https://phabricator.wikimedia.org/T147061#2692344 (10faidon) [12:34:25] 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2692345 (10Gilles) @fgiunchedi is there any way I could get rights to access those temp folders and the manhole files inside of it? Even if I guess the... [12:38:00] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2692348 (10Gilles) @fgiunchedi for this situation as well, it would be easier for me to debug if I was able to run 'du' on the temp folders, but my user can't. [12:39:16] 06Operations, 06Performance-Team, 10Thumbor: thumbor ffmpeg pipe deadlock - https://phabricator.wikimedia.org/T145626#2692350 (10Gilles) @fgiunchedi what symptoms/commands could I run to figure out if this is happening again? [12:39:42] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2692352 (10Gilles) [12:40:00] (03PS3) 10Gehel: osm - use osm2pgsql from jessie-backport [puppet] - 10https://gerrit.wikimedia.org/r/314244 [12:40:37] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2692356 (10elukey) [12:40:39] 06Operations, 06Labs: cronspam from labscontrol1001, labstore1001, labnet1002.eqiad.wmnet, labsdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T132422#2692354 (10elukey) 05Open>03Resolved a:03elukey [12:40:54] 06Operations, 13Patch-For-Review: graphite-web cronspam - https://phabricator.wikimedia.org/T144797#2692357 (10elukey) 05Open>03Resolved a:03elukey [12:40:56] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2194449 (10elukey) [12:41:20] 06Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2194460 (10elukey) [12:41:25] (03CR) 10Gehel: [C: 032] osm - use osm2pgsql from jessie-backport [puppet] - 10https://gerrit.wikimedia.org/r/314244 (owner: 10Gehel) [12:43:10] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [12:44:01] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2692380 (10BBlack) Seems like it's worth testing :) [12:45:46] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [12:46:45] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:48:37] 06Operations, 06Performance-Team, 10Thumbor: Some video files not recognized - https://phabricator.wikimedia.org/T147417#2692398 (10Gilles) [12:49:03] 06Operations, 06Performance-Team, 10Thumbor: Thumbor times out on large files sometimes - https://phabricator.wikimedia.org/T147412#2692413 (10Gilles) a:05fgiunchedi>03Gilles [12:49:09] 06Operations, 06Performance-Team, 10Thumbor: Some video files not recognized - https://phabricator.wikimedia.org/T147417#2692414 (10Gilles) a:05fgiunchedi>03Gilles [12:50:39] !log dropping views jamwiki_p.abuse_filter_history drop view adywiki_p.abuse_filter_history - T147413 [12:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:51:38] PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:52:47] hashar_: looks like there is nothing for eu swat today, I am going to get lunch then :) [12:53:13] (03PS1) 10Faidon Liambotis: Drain ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/314253 [12:53:26] zeljkof: happy meal : [12:53:26] ) [12:54:09] (03CR) 10Faidon Liambotis: [C: 032] Drain ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/314253 (owner: 10Faidon Liambotis) [12:55:00] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2692427 (10Gilles) [12:55:03] 06Operations, 06Performance-Team, 10Thumbor: VIPS engine should generate JPG when dealing with TIFFs and not have the IM engine read it - https://phabricator.wikimedia.org/T145638#2692426 (10Gilles) 05Open>03Resolved [12:56:28] (03CR) 10Mobrovac: [C: 031] "LGTM. @Dzahn, feel free to merge, it will get picked up on the next deploy / restart." [puppet] - 10https://gerrit.wikimedia.org/r/312808 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [12:57:00] (03PS1) 10Gehel: osm - fixing dependency cycles [puppet] - 10https://gerrit.wikimedia.org/r/314254 [12:58:20] (03CR) 10Mobrovac: [WIP]: Cassandra TWCS deploy repository (031 comment) [software/cassandra-twcs] - 10https://gerrit.wikimedia.org/r/313825 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [12:59:09] (03PS2) 10Gehel: osm - fixing dependency cycles [puppet] - 10https://gerrit.wikimedia.org/r/314254 [12:59:55] (03PS1) 10DCausse: Enable subphrases autocomplete on wikisources, mw.org and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314255 (https://phabricator.wikimedia.org/T146208) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161005T1300). Please do the needful. [13:01:15] (03PS1) 10Gilles: Increase HTTP_LOADER_REQUEST_TIMEOUT for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/314256 [13:01:41] (03CR) 10Gehel: [C: 032] osm - fixing dependency cycles [puppet] - 10https://gerrit.wikimedia.org/r/314254 (owner: 10Gehel) [13:02:51] (03PS1) 10DCausse: Activate subphrases autocomplete on wikisources, mw.org and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314257 (https://phabricator.wikimedia.org/T146208) [13:03:35] (03CR) 10DCausse: [C: 04-1] "Should be activated only once the FSTs have been built" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314257 (https://phabricator.wikimedia.org/T146208) (owner: 10DCausse) [13:06:49] 06Operations, 10Ops-Access-Requests, 10netops: Give elukey/ema access to network devices - https://phabricator.wikimedia.org/T147061#2692470 (10faidon) 05Open>03Resolved Done for both of you, on all 30 devices (11 core routers, 12 access switches, 4 management routers, 2 management switches, 1 peering sw... [13:07:43] !log upgrading JunOS on cr2-ulsfo [13:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:09:49] (03CR) 10Mobrovac: [C: 031] "LGTM and PCC looking good too - https://puppet-compiler.wmflabs.org/4213/" [puppet] - 10https://gerrit.wikimedia.org/r/313892 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [13:11:08] (03Abandoned) 10DCausse: Add initial rescore profiles for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) (owner: 10DCausse) [13:14:21] PROBLEM - Host cr2-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.193) [13:16:01] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:17:11] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 56, down: 3, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-ulsfo:xe-0/0/0 [10Gbps DF]BRxe-1/0/0: down - Core: cr2-ulsfo:xe1/0/0 [10Gbps DF]BRae0: down - Core: cr2-ulsfo:ae0BR [13:17:34] !log upgrading kernel packages on cp* cache hosts (no reboots yet) [13:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:31] PROBLEM - Host cr2-ulsfo IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ffff::2 [13:19:30] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [13:20:48] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: Build etcd clusters to support Kubernetes and calico - https://phabricator.wikimedia.org/T147421#2692508 (10Joe) [13:21:35] (03PS1) 10Elukey: Refactor memcached role to allow a more flexible hieradata config [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) [13:24:05] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: Build etcd clusters to support Kubernetes and calico - https://phabricator.wikimedia.org/T147421#2692523 (10Joe) Calico is able to talk to the etcd cluster using client-side certificates: https://github.com/projectcalico/calico-containers/... [13:24:29] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Build etcd clusters to support Kubernetes and calico - https://phabricator.wikimedia.org/T147421#2692537 (10Joe) [13:24:42] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [13:24:50] RECOVERY - Host cr2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 81.58 ms [13:24:54] (03PS2) 10Marostegui: mariadb: Decommission db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314252 (https://phabricator.wikimedia.org/T146265) [13:25:12] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [13:25:20] RECOVERY - Host cr2-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 79.20 ms [13:26:31] (03CR) 10Jcrespo: [C: 031] mariadb: Decommission db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314252 (https://phabricator.wikimedia.org/T146265) (owner: 10Marostegui) [13:29:41] (03CR) 10Marostegui: [C: 032] mariadb: Decommission db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314252 (https://phabricator.wikimedia.org/T146265) (owner: 10Marostegui) [13:30:04] (03PS2) 10Filippo Giunchedi: Add key for gilles' new laptop [puppet] - 10https://gerrit.wikimedia.org/r/313199 (owner: 10Gilles) [13:30:09] (03Merged) 10jenkins-bot: mariadb: Decommission db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314252 (https://phabricator.wikimedia.org/T146265) (owner: 10Marostegui) [13:31:12] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 5 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py],File[/usr/local/bin/puppet-enabled],File[/usr/lib/nagios/plugins/check_sysctl],File[/etc/sysctl.d] [13:31:26] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Minor comment, but good idea in general" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [13:31:48] (03CR) 10Filippo Giunchedi: [C: 032] "verified on hangout" [puppet] - 10https://gerrit.wikimedia.org/r/313199 (owner: 10Gilles) [13:32:02] !log upgrading JunOS on cr2-ulsfo (attempt 2) [13:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:32:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: wmf-config/db-codfw.php Remove db1019 entries as it is going to be decommissioned - T146265 (duration: 00m 49s) [13:32:40] T146265: db1019: Decommission - https://phabricator.wikimedia.org/T146265 [13:32:40] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 32 probes of 244 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:02] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 273 probes of 420 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [13:33:25] (03PS3) 10Andrew Bogott: Move toollabs node classes to roles. [puppet] - 10https://gerrit.wikimedia.org/r/314180 (https://phabricator.wikimedia.org/T147233) [13:33:27] (03PS1) 10Andrew Bogott: Added role::toollabs:legacy [puppet] - 10https://gerrit.wikimedia.org/r/314261 (https://phabricator.wikimedia.org/T147233) [13:33:50] (03PS1) 10DCausse: [WIP] Adjust shard & replica count for enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314262 [13:34:08] (03PS2) 10Gehel: wdqs LVS DNS entries [dns] - 10https://gerrit.wikimedia.org/r/312216 (https://phabricator.wikimedia.org/T132457) [13:36:37] (03CR) 10Gehel: [C: 032] wdqs LVS DNS entries [dns] - 10https://gerrit.wikimedia.org/r/312216 (https://phabricator.wikimedia.org/T132457) (owner: 10Gehel) [13:39:16] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 5 probes of 244 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:39:57] (03PS2) 10Filippo Giunchedi: prometheus: generate varnish targets from conftool [puppet] - 10https://gerrit.wikimedia.org/r/310819 [13:40:08] ema: ^ [13:42:04] !log upgrading neodymium to Linux 4.4 [13:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:42:19] (03PS3) 10Gehel: wdqs - LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/312223 (https://phabricator.wikimedia.org/T132457) [13:44:35] (03CR) 10Elukey: Refactor memcached role to allow a more flexible hieradata config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [13:46:01] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 623 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3102399 keys - replication_delay is 623 [13:46:18] !log deploying new LVS configuration for WDQS service - T132457 [13:46:19] T132457: Move wdqs to an LVS service - https://phabricator.wikimedia.org/T132457 [13:46:21] PROBLEM - Host cr2-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.193) [13:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:46:27] (03PS2) 10Filippo Giunchedi: Increase HTTP_LOADER_REQUEST_TIMEOUT for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/314256 (owner: 10Gilles) [13:47:12] (03PS2) 10DCausse: Initialize subphrases autocomplete on wikisources, mw.org and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314255 (https://phabricator.wikimedia.org/T146208) [13:47:19] (03PS2) 10DCausse: Activate subphrases autocomplete on wikisources, mw.org and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314257 (https://phabricator.wikimedia.org/T146208) [13:47:28] !log adding mw120[67] back to the api appservers live pool after reimage [13:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:48:30] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [13:48:31] PROBLEM - Host cr2-ulsfo IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ffff::2 [13:48:34] (03CR) 10Filippo Giunchedi: [C: 032] Increase HTTP_LOADER_REQUEST_TIMEOUT for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/314256 (owner: 10Gilles) [13:49:00] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 56, down: 3, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-ulsfo:xe-0/0/0 [10Gbps DF]BRxe-1/0/0: down - Core: cr2-ulsfo:xe1/0/0 [10Gbps DF]BRae0: down - Core: cr2-ulsfo:ae0BR [13:51:06] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2692575 (10fgiunchedi) p:05Triage>03Normal [13:52:50] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 420 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [13:53:10] bblack, _joe_: I have an error in pybal logs on lvs1006 after deploying https://gerrit.wikimedia.org/r/#/c/312223 (exceptions.ValueError: Value of arguments is not a string or stringlist) [13:53:37] I have absolutely no idea what this is about (though I suspect something between chair and keyboard on my side) [13:54:45] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:54:46] gehel: looking [13:54:52] bblack: thanks! [13:54:54] 06Operations, 05Prometheus-metrics-monitoring: Port HHVM metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147423#2692577 (10fgiunchedi) [13:55:53] gehel: which have you puppeted->pybal-restarted since deploy? [13:56:06] bblack: only lvs1006 [13:56:10] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [13:56:21] RECOVERY - Host cr2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 79.20 ms [13:56:26] (03CR) 10Andrew Bogott: [C: 032] Added role::toollabs:legacy [puppet] - 10https://gerrit.wikimedia.org/r/314261 (https://phabricator.wikimedia.org/T147233) (owner: 10Andrew Bogott) [13:56:32] (03PS2) 10Andrew Bogott: Added role::toollabs:legacy [puppet] - 10https://gerrit.wikimedia.org/r/314261 (https://phabricator.wikimedia.org/T147233) [13:56:40] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [13:57:28] (03CR) 10Eevans: [WIP]: Cassandra TWCS deploy repository (031 comment) [software/cassandra-twcs] - 10https://gerrit.wikimedia.org/r/313825 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [13:57:44] (03PS6) 10Eevans: Cassandra TWCS deploy repository [software/cassandra-twcs] - 10https://gerrit.wikimedia.org/r/313825 (https://phabricator.wikimedia.org/T133395) [13:57:58] !log upgrading JunOS on cr1-ulsfo [13:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:27] bblack: I'm stupid... I did not actually merge the change, so I actually just restarted pybal on lvs1006 (with no puppet changes) and I'm seeing those errors. [13:58:50] ok [13:59:13] bblack: still, should I worry about it? [13:59:16] well, we have known issues with the current pybal etcd code, maybe note that one to add to existing tickets if there isn't one about it [14:00:16] bblack: ok, I'll look. I'm not gong to provide much context... but at least I can create a ticket. [14:00:24] gehel: I'm assuming you're only looking to restart pybal on affected nodes (lvs[12]006, then some minutes later lvs[12]003) [14:00:41] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:54] RECOVERY - Host cr2-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 79.56 ms [14:01:58] 06Operations, 05Prometheus-metrics-monitoring: Port varnish metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147424#2692598 (10fgiunchedi) [14:02:00] bblack: I was planning on lvs10[(06|09|12), then lvs1003, then codfw with lvs2006 then lvs2003 [14:02:23] !log rebooting eeden (ns2.wikimedia.org) [14:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:51] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 134 probes of 244 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:02:55] (03PS3) 10Filippo Giunchedi: prometheus: generate varnish targets from conftool [puppet] - 10https://gerrit.wikimedia.org/r/310819 (https://phabricator.wikimedia.org/T147424) [14:03:21] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:04:43] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 325 probes of 420 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [14:04:43] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [14:05:17] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tmux] [14:05:40] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [14:05:41] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [14:07:01] 06Operations, 10Pybal: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425#2692618 (10Gehel) [14:07:16] (03CR) 10Gehel: [C: 032] wdqs - LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/312223 (https://phabricator.wikimedia.org/T132457) (owner: 10Gehel) [14:07:22] (03PS4) 10Gehel: wdqs - LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/312223 (https://phabricator.wikimedia.org/T132457) [14:07:35] (03PS1) 10Muehlenhoff: Add retroactively assigned CVE IDs to already released patches [debs/linux44] - 10https://gerrit.wikimedia.org/r/314266 [14:07:53] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 85.28 ms [14:09:21] (03CR) 10Muehlenhoff: [C: 032] Add retroactively assigned CVE IDs to already released patches [debs/linux44] - 10https://gerrit.wikimedia.org/r/314266 (owner: 10Muehlenhoff) [14:09:22] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 10 probes of 244 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:11:24] (03CR) 10Mobrovac: [C: 031] Cassandra TWCS deploy repository [software/cassandra-twcs] - 10https://gerrit.wikimedia.org/r/313825 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [14:11:40] (03PS4) 10Andrew Bogott: Move toollabs node classes to roles. [puppet] - 10https://gerrit.wikimedia.org/r/314180 (https://phabricator.wikimedia.org/T147233) [14:11:42] (03PS1) 10Andrew Bogott: Include memcached in role::deprecated::mediawiki::install [puppet] - 10https://gerrit.wikimedia.org/r/314268 (https://phabricator.wikimedia.org/T147233) [14:11:45] 06Operations, 05Prometheus-metrics-monitoring: Port gdnsd statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147426#2692643 (10fgiunchedi) [14:11:48] (03PS4) 10Dzahn: RESTBase configuration for olo.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/312808 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [14:11:50] !log restarting pybal on lvs1006 - T132457 [14:11:51] T132457: Move wdqs to an LVS service - https://phabricator.wikimedia.org/T132457 [14:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:12:30] (03CR) 10Eevans: [C: 032 V: 032] Cassandra TWCS deploy repository [software/cassandra-twcs] - 10https://gerrit.wikimedia.org/r/313825 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [14:13:34] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:13:49] (03CR) 10Dzahn: [C: 032] RESTBase configuration for olo.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/312808 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [14:15:00] !log restarting pybal on lvs1009 - T132457 [14:15:02] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:31] PROBLEM - Host cr1-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.192) [14:16:26] gehel: yeah good point about 9 and 12, I forget about those because we've never gotten over the hurdles to actually use them :) [14:16:49] bblack: can't hurt to restart them... [14:17:24] yeah [14:17:31] !log restarting pybal on lvs1012 - T132457 [14:17:32] T132457: Move wdqs to an LVS service - https://phabricator.wikimedia.org/T132457 [14:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:39] !log rebooting baham (ns1.wikimedia.org) [14:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:53] !log restarting pybal on lvs1003 - T132457 [14:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:28] PROBLEM - Host cr1-ulsfo IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ffff::1 [14:19:42] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [14:19:52] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 66, down: 3, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr1-ulsfo:xe-0/0/0 [10Gbps DF]BRxe-1/0/0: down - Core: cr1-ulsfo:xe-1/0/0 [10Gbps DF]BRae0: down - BR [14:20:33] 06Operations, 10Traffic, 05Prometheus-metrics-monitoring: Port gdnsd statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147426#2692702 (10ema) [14:20:54] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:21:41] 06Operations, 10Traffic, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port varnish metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147424#2692709 (10fgiunchedi) [14:22:27] !log restarting pybal on lvs1006 - T132457 [14:23:54] 06Operations, 10Traffic, 05Prometheus-metrics-monitoring: Port vhtcpd statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147429#2692715 (10fgiunchedi) [14:24:28] 06Operations, 10Monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#2692729 (10faidon) [14:24:37] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 0 probes of 420 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [14:25:01] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [14:25:23] !log db1055 replacing disk slot 0 [14:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:32] RECOVERY - Host cr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 83.78 ms [14:25:45] 06Operations, 10ops-ulsfo: cr1-ulsfo broken serial cable (or port) - https://phabricator.wikimedia.org/T147430#2692731 (10faidon) [14:26:20] RECOVERY - Host cr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 81.34 ms [14:26:30] 06Operations, 10netops: Upgrade cr1-ulsfo & cr2-ulsfo to JunOS 13.3 - https://phabricator.wikimedia.org/T143914#2692745 (10faidon) 05Open>03Resolved a:03faidon Done! [14:27:08] let's wait 10-15 minutes for BGP to converge [14:27:10] then I'll repool [14:27:32] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [14:27:40] !log restarting pybal on lvs1003 - T132457 [14:27:41] T132457: Move wdqs to an LVS service - https://phabricator.wikimedia.org/T132457 [14:28:21] PROBLEM - Host misc-web-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::3:d [14:28:31] marostegu hi, im wondering if you could help me do stopwords in the innodb table for phabricator please? [14:28:34] ignore the page [14:28:38] ok [14:28:42] <_joe_> ok [14:28:43] ulsfo is drained [14:28:48] ah ha [14:28:51] paladox: what do you need? [14:29:17] Im wondering how do i create a table and insert our stopwords we had for myisam before switching to innodb [14:29:20] in puppet [14:29:23] PROBLEM - Host maps-lb.ulsfo.wikimedia.org is DOWN: CRITICAL - Network Unreachable (198.35.26.113) [14:29:29] hm [14:29:44] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:30:06] Just mysql doint support custom stopword files any more for innodb [14:30:12] (03PS2) 10Elukey: Refactor memcached role to allow a more flexible hieradata config [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) [14:30:16] so it is a little harder for doing it with table [14:30:30] marostegu ^^ [14:30:44] PROBLEM - Host misc-web-lb.ulsfo.wikimedia.org is DOWN: CRITICAL - Network Unreachable (198.35.26.120) [14:31:08] mutante helped find this https://dev.mysql.com/doc/refman/5.6/en/fulltext-stopwords.html for me [14:31:15] (03CR) 10Elukey: [C: 04-1] "Still didn't take into account Giuseppe's comment about the extended_options hiera lookup, will follow up with him." [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [14:31:20] paladox: I haven't created any tables in Puppet yet. However, I was talking to jynus earlier, and he told me he is going to be doing work on phabricator later so maybe worth waiting for him? [14:31:25] RECOVERY - Host misc-web-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 78.97 ms [14:31:28] Ok [14:31:29] thanks [14:31:44] I will wait for him, :), would you know when he will be back on? [14:32:04] paladox: Getting the list of words is easy indeed, but I am not sure about how to integrate all that with our puppet repo yet [14:32:21] Yep me too :) [14:32:52] volans: Might know [14:32:55] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=wdqs,service=wdqs [14:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:23] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=wdqs,service=wdqs [14:33:24] (03PS1) 10Muehlenhoff: Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) [14:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:46] Oh [14:33:47] RECOVERY - Host misc-web-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 78.95 ms [14:34:14] marostegui https://github.com/wikimedia/operations-puppet/commit/fd19301dedcfcea04788a4e3fdb615e861ae15de [14:34:18] 06Operations, 10ops-eqiad, 10DBA: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2692766 (10Cmjohnson) swapped the disk (again) [14:34:26] Might help, since i think that is doing sql in puppet [14:34:43] RECOVERY - Host maps-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 79.65 ms [14:35:17] (03PS3) 10Gehel: wdqs - add icinga check for LVS services [puppet] - 10https://gerrit.wikimedia.org/r/312224 (https://phabricator.wikimedia.org/T132457) [14:35:42] (03PS2) 10Madhuvishy: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) [14:36:37] (03CR) 10Gehel: [C: 032] wdqs - add icinga check for LVS services [puppet] - 10https://gerrit.wikimedia.org/r/312224 (https://phabricator.wikimedia.org/T132457) (owner: 10Gehel) [14:36:44] (03CR) 10jenkins-bot: [V: 04-1] labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [14:38:32] (03PS2) 10Andrew Bogott: Include memcached in role::deprecated::mediawiki::install [puppet] - 10https://gerrit.wikimedia.org/r/314268 (https://phabricator.wikimedia.org/T147233) [14:39:57] (03PS3) 10Madhuvishy: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) [14:40:13] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [14:40:32] (03CR) 10Faidon Liambotis: [C: 04-1] install_server: use http/lpxelinux on install hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314251 (owner: 10Filippo Giunchedi) [14:40:44] (03CR) 10Andrew Bogott: [C: 032] Include memcached in role::deprecated::mediawiki::install [puppet] - 10https://gerrit.wikimedia.org/r/314268 (https://phabricator.wikimedia.org/T147233) (owner: 10Andrew Bogott) [14:41:41] (03PS1) 10Faidon Liambotis: Revert "Drain ulsfo for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/314272 [14:44:01] (03PS2) 10Faidon Liambotis: Revert "Drain ulsfo for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/314272 [14:45:16] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [14:46:20] (03CR) 10Faidon Liambotis: [C: 032] Revert "Drain ulsfo for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/314272 (owner: 10Faidon Liambotis) [14:48:06] 06Operations, 10ops-eqiad, 10DBA: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2692931 (10Marostegui) Thanks - I will check tomorrow if it built successfully [14:49:17] hm… this change seems to be including the memcached role rather than the memcached class: https://gerrit.wikimedia.org/r/#/c/314268/2/modules/role/manifests/deprecated/mediawiki/install.pp [14:49:33] (03PS5) 10Eevans: Extend classpath via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/313619 (https://phabricator.wikimedia.org/T133395) [14:49:34] Not what I expected! Is there a way to specify? [14:50:07] (03PS5) 10Eevans: Enable cassandra/twcs deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/313892 (https://phabricator.wikimedia.org/T133395) [14:50:16] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [14:50:38] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2692982 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2002.codfw.wmnet'] ``` The log can be found in `/va... [14:51:52] 06Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2057999 (10Luke081515) What we can do too: I setup a complete indipendent ircd at the past (for ircd related tests) at labs. Currently it allows at least 1000 users (I... [14:51:53] (03PS2) 10Filippo Giunchedi: install_server: use http/lpxelinux on install hosts [puppet] - 10https://gerrit.wikimedia.org/r/314251 [14:52:44] godog: did you test this with the bast3001 reinstall? [14:53:07] (03CR) 10Faidon Liambotis: "LGTM, if it has been tested and works :) I wonder if we should do it for all hosts rather than just the install servers." [puppet] - 10https://gerrit.wikimedia.org/r/314251 (owner: 10Filippo Giunchedi) [14:53:57] paravoid: I did! It worked after I figured just using a different prefix is enough [14:55:06] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [14:57:56] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:00:07] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [15:00:12] 06Operations: Monitor failing ferm restarts / availability of ferm service - https://phabricator.wikimedia.org/T108303#2693003 (10MoritzMuehlenhoff) [15:00:53] (03CR) 10Filippo Giunchedi: [C: 032] "+1 on using it on all hosts, we'd probably want to use a different vhost to keep apt and tftp separate." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314251 (owner: 10Filippo Giunchedi) [15:01:00] (03PS3) 10Filippo Giunchedi: install_server: use http/lpxelinux on install hosts [puppet] - 10https://gerrit.wikimedia.org/r/314251 [15:04:02] !log add lpxelinux.0 to volatile/tftpboot on puppet.eqiad.wmnet [15:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:29] (03PS1) 10Faidon Liambotis: Drain esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/314275 [15:04:42] godog: it might be also included on the syslinux package btw [15:04:49] PROBLEM - puppet last run on wmf4748 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:04:59] godog: but that would vary between different distro versions, so it might be smarter to just push it from volatile indeed [15:05:16] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 214 seconds ago with 0 failures [15:05:23] (03CR) 10Faidon Liambotis: [C: 032] Drain esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/314275 (owner: 10Faidon Liambotis) [15:05:47] PROBLEM - NTP on eeden is CRITICAL: NTP CRITICAL: Offset unknown [15:05:52] paravoid: yeah I was reading today about that and the ldlinux.c32 split syslinux did, that's the jessie version [15:07:01] (03Abandoned) 10Dzahn: install: add network location to server MOTDs [puppet] - 10https://gerrit.wikimedia.org/r/313375 (https://phabricator.wikimedia.org/T84518) (owner: 10Dzahn) [15:10:53] 06Operations, 10ops-eqiad: Add new disks to syslog server in eqiad (lithium) - https://phabricator.wikimedia.org/T143307#2693015 (10fgiunchedi) 05stalled>03Open a:05fgiunchedi>03Cmjohnson @Cmjohnson we can go ahead with swapping the disks and reimage now. wezen.codfw.wmnet has a month worth of logs for... [15:12:18] (03PS5) 10Andrew Bogott: Move toollabs node classes to roles. [puppet] - 10https://gerrit.wikimedia.org/r/314180 (https://phabricator.wikimedia.org/T147233) [15:12:20] (03PS1) 10Andrew Bogott: include ::memcached instead of memcached [puppet] - 10https://gerrit.wikimedia.org/r/314276 [15:13:32] @godog: i can swap disks now [15:14:00] (03CR) 10Andrew Bogott: [C: 032] include ::memcached instead of memcached [puppet] - 10https://gerrit.wikimedia.org/r/314276 (owner: 10Andrew Bogott) [15:14:52] cmjohnson1: nice! ok I've downtimed lithium in icinga [15:14:55] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2693025 (10RobH) They should be going away this month (likely within the next couple of weeks), since the test hosts ordered for t... [15:15:01] great....thx [15:17:10] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:17:27] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:18:49] (03PS1) 10Filippo Giunchedi: install_server: reinstall lithium with jessie and gpt [puppet] - 10https://gerrit.wikimedia.org/r/314280 (https://phabricator.wikimedia.org/T143307) [15:18:55] !log upgrading JunOS on cr1-esams [15:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:18] (03PS2) 10Filippo Giunchedi: install_server: reinstall lithium with jessie and gpt [puppet] - 10https://gerrit.wikimedia.org/r/314280 (https://phabricator.wikimedia.org/T143307) [15:22:30] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:22:31] (03CR) 10Filippo Giunchedi: [C: 032] install_server: reinstall lithium with jessie and gpt [puppet] - 10https://gerrit.wikimedia.org/r/314280 (https://phabricator.wikimedia.org/T143307) (owner: 10Filippo Giunchedi) [15:24:50] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2693065 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2002.codfw.wmnet'] ``` Those hosts were successful: ``` ['maps-test2002.codfw.wmnet'] ``` [15:27:16] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:27:27] RECOVERY - puppet last run on wmf4748 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:28:01] wmf4748 ? [15:28:04] ??? [15:29:07] akosiaris: https://phabricator.wikimedia.org/T146171#2693025 [15:29:07] akosiaris: https://phabricator.wikimedia.org/T146171 [15:31:38] 06Operations, 10ops-eqiad, 13Patch-For-Review: Add new disks to syslog server in eqiad (lithium) - https://phabricator.wikimedia.org/T143307#2693127 (10Cmjohnson) @fgiunchedi The disks have been swapped and you're free to reinstall. [15:32:06] ok, thanks [15:33:16] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:34:57] RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.039 second response time [15:35:09] PROBLEM - Host cr1-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.245) [15:35:16] !log reimage lithium with bigger disks T143307 [15:35:17] T143307: Add new disks to syslog server in eqiad (lithium) - https://phabricator.wikimedia.org/T143307 [15:35:18] cmjohnson1: thanks! [15:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:06] !log restarted hhvm on mw1274, was stuck [15:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:18] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:38:40] (03PS1) 10Alexandros Kosiaris: icinga: Parameterize icinga class to not notify [puppet] - 10https://gerrit.wikimedia.org/r/314282 [15:39:44] !log T146211: Restarting Cassandra on restbase1007-a.eqiad.wmnet to mark parsoid.data-parsoid tables unrepaired [15:39:45] T146211: Cluster-wide major compactions: parsoid.data-parsoid table - https://phabricator.wikimedia.org/T146211 [15:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:40:47] PROBLEM - Host cr1-esams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::5 [15:41:18] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:44:40] RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 87.42 ms [15:44:55] !log upgrading JunOS on cr2-knams [15:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:49] RECOVERY - Host cr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 84.74 ms [15:48:05] !log T146211: Restarting Cassandra on restbase1007-b.eqiad.wmnet to mark parsoid.data-parsoid tables unrepaired [15:48:07] T146211: Cluster-wide major compactions: parsoid.data-parsoid table - https://phabricator.wikimedia.org/T146211 [15:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:26] (03PS3) 10Elukey: Refactor memcached role to allow a more flexible hieradata config [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) [15:51:49] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:52:33] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:54:15] !log T146211: Restarting Cassandra on restbase1007-c.eqiad.wmnet to mark parsoid.data-parsoid tables unrepaired [15:54:16] T146211: Cluster-wide major compactions: parsoid.data-parsoid table - https://phabricator.wikimedia.org/T146211 [15:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:58:28] cmjohnson1: sigh, I think lithium might need a poke/reseat, it is stuck at "Scanning for devices. Please wait, this may take several minutes..." since 10/15 min now [15:58:39] I've left the console [15:59:02] elvis has left the console! [15:59:54] * godog pictures fat elvis DJs [16:00:01] DJing even [16:01:17] okay...i will look in a bit [16:03:33] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:03:48] PROBLEM - Host cr2-knams is DOWN: CRITICAL - Network Unreachable (91.198.174.246) [16:04:27] (03PS1) 10Andrew Bogott: Add role::labs::bootstrapvz [puppet] - 10https://gerrit.wikimedia.org/r/314285 (https://phabricator.wikimedia.org/T147233) [16:08:01] (03PS1) 10Paladox: Create a phabricator_stopwords phabricator table in sql (innodb) [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [16:08:09] PROBLEM - Host cr2-knams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::4 [16:08:16] marostegu Hi, i think thats ^^ how to do it? [16:08:25] But im unsure what the phabricator db is called [16:08:56] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2693426 (10RobH) 05stalled>03declined Please note that this task has sat open pending feedback from @dpatrick since August 2nd. As such, I'm closing this as declined.... [16:08:57] which is needed for variable innodb_ft_server_stopword_table [16:09:30] PROBLEM - HHVM rendering on mw1209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.014 second response time [16:09:35] 06Operations, 10MediaWiki-API, 10Monitoring, 06Services, 10Traffic: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#2693434 (10GWicke) > I think anything we have today is going to just create more icinga spam in the IRC channel? While I agree that we n... [16:10:04] (03PS2) 10Paladox: Create a phabricator_stopwords phabricator table in sql (innodb) [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [16:11:25] paladox: db1043.eqiad.wmnet afaik [16:11:37] thanks [16:11:42] so can i do [16:11:51] db1043/phabricator_stopwords? [16:11:52] paladox: db2012 in codfw [16:11:58] i dont know that part [16:12:02] Oh [16:12:09] RECOVERY - HHVM rendering on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 70734 bytes in 0.113 second response time [16:12:26] paladox: i just see that in site.pp, the nodes that have role::mariadb::misc::phabricator [16:12:33] Needs to be set like db_name/table_name [16:12:39] Oh [16:13:34] and cr2-knams is back online [16:13:43] I'll give it another 15 minutes to converge again and repool [16:14:08] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:14:09] RECOVERY - Host cr2-knams is UP: PING OK - Packet loss = 0%, RTA = 85.71 ms [16:14:12] 06Operations, 10netops: Upgrade cr1-esams & cr2-knams to JunOS 13.3 - https://phabricator.wikimedia.org/T143913#2693493 (10faidon) 05Open>03Resolved a:03faidon …aand done as well. [16:14:58] RECOVERY - Host cr2-knams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 88.17 ms [16:16:06] (03CR) 10Andrew Bogott: [C: 032] Add role::labs::bootstrapvz [puppet] - 10https://gerrit.wikimedia.org/r/314285 (https://phabricator.wikimedia.org/T147233) (owner: 10Andrew Bogott) [16:21:46] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/4216/ - noop as expected" [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [16:22:05] 06Operations, 10Traffic, 10netops: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2693648 (10BBlack) Bump [16:22:08] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:24:35] 06Operations, 06Analytics-Kanban, 06Performance-Team, 10Traffic: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2693678 (10Nuria) Working on doc https://docs.google.com/document/d/1jRGjVAthJXoCovxyvXWyg07R1POb8zvD_n8IlJXrPVM/edit# Will start addressing @ellery's la... [16:28:03] 06Operations, 05Prometheus-metrics-monitoring: Upgrade mysqld_exporter to 0.9.0 - https://phabricator.wikimedia.org/T147476#2693704 (10fgiunchedi) [16:28:08] (03PS1) 10Legoktm: Don't grant editcontentmodel to all users yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314290 [16:28:39] !log upgrade mysqld_exporter to 0.9.0 on db2030 T147476 [16:28:40] T147476: Upgrade mysqld_exporter to 0.9.0 - https://phabricator.wikimedia.org/T147476 [16:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:09] jouncebot: next [16:29:09] In 1 hour(s) and 30 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161005T1800) [16:29:21] (03CR) 10Legoktm: [C: 032] Don't grant editcontentmodel to all users yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314290 (owner: 10Legoktm) [16:29:42] 06Operations, 10scap, 03Scap3 (Scap3-MediaWiki-MVP): Move scap target configuration to etcd - https://phabricator.wikimedia.org/T115899#2693727 (10thcipriani) p:05Normal>03Low [16:29:48] (03Merged) 10jenkins-bot: Don't grant editcontentmodel to all users yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314290 (owner: 10Legoktm) [16:30:08] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:33:38] (03PS1) 10Faidon Liambotis: Revert "Drain esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/314292 [16:33:51] (03PS31) 10Rush: Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [16:33:59] (03CR) 10Faidon Liambotis: [C: 032] Revert "Drain esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/314292 (owner: 10Faidon Liambotis) [16:34:29] 06Operations, 05Prometheus-metrics-monitoring: Upgrade mysqld_exporter to 0.9.0 - https://phabricator.wikimedia.org/T147476#2693735 (10fgiunchedi) For the collectors we have enabled, the difference is added dimensions for replication metrics: (channel_name / master_host / master_uuid) ```lines=4 @@ -934,27 +9... [16:35:53] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Don't grant editcontentmodel to all users yet (duration: 01m 01s) [16:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:45] (03PS2) 10Alexandros Kosiaris: icinga: Parameterize icinga class to not notify [puppet] - 10https://gerrit.wikimedia.org/r/314282 [16:41:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Parameterize icinga class to not notify [puppet] - 10https://gerrit.wikimedia.org/r/314282 (owner: 10Alexandros Kosiaris) [16:43:54] (03PS1) 10Paladox: Replace broken rss links with the correct rss links [puppet] - 10https://gerrit.wikimedia.org/r/314293 (https://phabricator.wikimedia.org/T134437) [16:44:06] (03PS1) 10Dzahn: planet: remove some broken feeds with "no data" [puppet] - 10https://gerrit.wikimedia.org/r/314294 (https://phabricator.wikimedia.org/T134437) [16:45:14] (03PS1) 10Giuseppe Lavagetto: scap_source: use one provider, pass "origin" as a parameter [puppet] - 10https://gerrit.wikimedia.org/r/314295 [16:45:16] (03PS1) 10Giuseppe Lavagetto: scap_source: enforce the origin url [puppet] - 10https://gerrit.wikimedia.org/r/314296 (https://phabricator.wikimedia.org/T143692) [16:47:18] (03PS2) 10Paladox: planet: fix broken blogspot RSS links by label [puppet] - 10https://gerrit.wikimedia.org/r/314293 (https://phabricator.wikimedia.org/T134437) [16:47:22] 07Puppet, 06Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2693827 (10demon) a:05demon>03None [16:48:21] (03CR) 10Dzahn: [C: 032] "thanks for the help" [puppet] - 10https://gerrit.wikimedia.org/r/314293 (https://phabricator.wikimedia.org/T134437) (owner: 10Paladox) [16:48:34] thanks and your welcome ^^ [16:48:35] :) [16:50:01] (03CR) 10jenkins-bot: [V: 04-1] scap_source: enforce the origin url [puppet] - 10https://gerrit.wikimedia.org/r/314296 (https://phabricator.wikimedia.org/T143692) (owner: 10Giuseppe Lavagetto) [16:51:05] (03PS3) 10Dzahn: planet: fix broken blogspot RSS links by label [puppet] - 10https://gerrit.wikimedia.org/r/314293 (https://phabricator.wikimedia.org/T134437) (owner: 10Paladox) [16:53:34] 06Operations, 10Traffic, 15User-Joe, 07discovery-system: Upgrade conftool to 0.3.1 - https://phabricator.wikimedia.org/T147480#2693856 (10Joe) [16:53:45] 06Operations, 10Traffic, 15User-Joe, 07discovery-system: Upgrade conftool to 0.3.1 - https://phabricator.wikimedia.org/T147480#2693868 (10Joe) p:05Triage>03High [16:59:34] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2693879 (10chasemp) I feel comfortable that https://gerrit.wikimedia.org/r/#/c/295607/ is a replication of https://github.com/wi... [17:00:32] (03PS1) 10Andrew Bogott: Added role::wikidata::builder [puppet] - 10https://gerrit.wikimedia.org/r/314301 (https://phabricator.wikimedia.org/T147233) [17:02:03] (03CR) 10Andrew Bogott: [C: 032] Added role::wikidata::builder [puppet] - 10https://gerrit.wikimedia.org/r/314301 (https://phabricator.wikimedia.org/T147233) (owner: 10Andrew Bogott) [17:02:08] (03PS1) 10Alexandros Kosiaris: tegmen: Assign icinga classes [puppet] - 10https://gerrit.wikimedia.org/r/314303 [17:02:30] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3068091 keys - replication_delay is 0 [17:04:44] sure Reedy [17:05:21] (03CR) 10Rush: [C: 032] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [17:08:16] (03PS1) 10Rush: labsdb: maintain-replicas.pl removal [software] - 10https://gerrit.wikimedia.org/r/314304 [17:09:31] chasemp: https://i.imgur.com/7drHiqr.gif [17:10:37] (03PS2) 10Alexandros Kosiaris: tegmen: Assign icinga classes [puppet] - 10https://gerrit.wikimedia.org/r/314303 [17:10:41] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] tegmen: Assign icinga classes [puppet] - 10https://gerrit.wikimedia.org/r/314303 (owner: 10Alexandros Kosiaris) [17:11:29] (03CR) 10Rush: [C: 032] labsdb: maintain-replicas.pl removal [software] - 10https://gerrit.wikimedia.org/r/314304 (owner: 10Rush) [17:13:46] chasemp, wanna make the patch to remove the problematic view from the script? [17:14:02] Krenair: sure [17:14:11] I mean, I could [17:14:19] but if you're already in there [17:15:33] (03PS1) 10Rush: maintain-replicas: remove abuse_filter_history view [software] - 10https://gerrit.wikimedia.org/r/314305 [17:16:06] can someone look if rdb1007.eqiad.wmnet is working ok ? there are many connection errors to it [17:16:08] (03PS2) 10Dzahn: planet: remove some broken feeds with "no data" [puppet] - 10https://gerrit.wikimedia.org/r/314294 (https://phabricator.wikimedia.org/T134437) [17:16:13] (03PS3) 10Dzahn: planet: remove some broken feeds with "no data" [puppet] - 10https://gerrit.wikimedia.org/r/314294 (https://phabricator.wikimedia.org/T134437) [17:18:22] (03CR) 10Dzahn: [C: 032] planet: remove some broken feeds with "no data" [puppet] - 10https://gerrit.wikimedia.org/r/314294 (https://phabricator.wikimedia.org/T134437) (owner: 10Dzahn) [17:19:32] matanya, well I can ping it [17:19:53] it is role jobqueue_redis [17:20:00] nc on 6379 works ? [17:21:06] I can use redis-cli against it [17:21:18] (03CR) 10Rush: [C: 032] maintain-replicas: remove abuse_filter_history view [software] - 10https://gerrit.wikimedia.org/r/314305 (owner: 10Rush) [17:21:19] so yes that port works [17:21:23] (03PS2) 10Muehlenhoff: Update to 4.4.23 [debs/linux44] - 10https://gerrit.wikimedia.org/r/314002 [17:21:27] thanks Krenair [17:23:10] what sort of errors are you seeing and where matanya? [17:23:41] (03PS5) 10Ema: varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 [17:23:52] !log installing libav security updates [17:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:24:01] Krenair: mw1256 Warning: timed out after 0.3 seconds when connecting to rdb1007.eqiad.wmnet [110]: Connection timed out [17:24:10] is it still occurring? [17:24:31] yes [17:25:34] I'm getting prod 503s. [17:25:51] what URL James_F? [17:25:54] Hmm, but not logged out. [17:25:55] https://www.mediawiki.org/w/index.php?title=Phabricator&diff=0&oldid=2241588 [17:26:10] I can view that while logged in [17:26:21] "Request from 198.73.209.2 via cp1055 cp1055, Varnish XID 3049288457 Error: 503, Service Unavailable at Wed, 05 Oct 2016 17:25:41 GMT" [17:26:50] Krinkle, any idea what's going on there? ^ [17:28:13] works for me not-logged-in [17:28:23] also it seems someone in OIT forgot to set up a reverse DNS entry for that IP [17:28:53] Yeah, not-logged-in works fine. [17:28:59] Maybe I've got an odd cookie? [17:29:14] well... [17:29:17] Other URLs on MW.org work for me logged-in. [17:29:20] for some value of "not logged in" [17:30:13] the top of the page in the browser says "not logged in", but I apparently still have a mediawikiwikiSession= cookie :P [17:30:21] Ho hum. [17:30:22] so it's passing all my traffic through cache pointlessly [17:30:26] Actually, it seems the corp DNS servers are refusing connections, at least from outside [17:31:21] shouldn't we be actively deleting session/token cookies if they refer to an invalid/expired session? [17:31:59] !log T146211: Performing rolling restart of restbase1010.eqiad.wmnet Cassandra instances, and marking SSTables unrepaired. [17:32:00] T146211: Cluster-wide major compactions: parsoid.data-parsoid table - https://phabricator.wikimedia.org/T146211 [17:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:32:39] Interestingly I also can't load https://www.mediawiki.org/wiki/Phabricator whilst logged-in. [17:33:43] I can load it while logged into either Krenair or Alex Monk (WMF) [17:33:43] I can view it, while logged-in [17:33:57] You just get a 503 there too James_F? [17:34:00] None of my cookies look particularly odd. [17:34:11] Krenair: Yeah. "Request from 198.73.209.2 via cp1053 cp1053, Varnish XID 4216756576 Error: 503, Service Unavailable at Wed, 05 Oct 2016 17:32:56 GMT" [17:34:21] bblack, any chance you can dig up the varnishlog for that? [17:34:30] possibly [17:34:48] well not past logs, but possible I can cook up a way to catch it when he tries again [17:35:28] !log installing chromium security updates on osmium [17:35:33] I suggest we try it, we don't know how widespread this problem is or really what's going on when it happens [17:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:36:03] we do seem to have some small spikes of new 503 issues on the text cluster the past couple of days... [17:36:10] (looking at metrics) [17:36:14] bblack: Happy to reload when you are ready. [17:37:18] James_F: do you get an X-Cache response header? [17:37:46] Yes. "x-cache: cp1053 miss, cp2010 miss, cp4009 miss, cp4018 miss" [17:38:57] is that for https://www.mediawiki.org/wiki/Phabricator ? [17:38:59] Yes. [17:39:05] odd.... [17:39:22] not the diff page, right? [17:39:33] Yeah. [17:39:48] try again? [17:39:48] Request URL:https://www.mediawiki.org/wiki/Phabricator Request Method:GET [17:39:57] Same again. [17:40:02] Varnish XID 2067008602 [17:40:08] x-cache:cp1067 miss, cp2010 pass, cp4009 miss, cp4018 miss [17:40:22] yeah, but different cache path [17:40:32] Yeah. [17:41:41] I get a white screen on the page when i click edit and save [17:41:45] Internet Explorer [17:41:49] James_F: one more time? [17:42:15] Varnish XID 372836574 / x-cache:cp1068 miss, cp2010 pass, cp4009 miss, cp4018 miss [17:44:13] James_F: I can reproduce it from "curl" on the commandline (through whatever cache path), if I copy the same Cookie header your browser is sending... [17:44:26] let me pare that down a bit and see which is causing it [17:44:33] * James_F nods. [17:45:18] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:45:45] paladox, apparently it returns HTTP 200 and Content-Length: 20, but no response body? [17:46:02] Yep [17:46:17] Well i get no http 200, but i guess thats in the f12 tools [17:47:47] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:49:06] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [17:50:42] bblack: having a valid session (cookie) and being logged out are not mutually exclusive. Visiting sign up, log in or edit pages will and should create a session from that moment forward for 30+ days. That's always been that way [17:52:32] ok, right [17:53:04] not that that's necessarily what we should be doing, but .... :P [17:53:15] there's a lot of dragons to chase in that area, with the various forms of session cookie [17:53:40] Yeah. I won't deny that it's not useful for most cases, but it's "normal" for now [17:53:52] James_F: I can repro with only 3x of your cookies set from curl: centralauth_User centralauth_Session centralauth_Token [17:53:58] something about your CA cookies... [17:54:13] Interesting. [17:54:31] well, probably something down in the MW stack somwhere that doesn't like the state of your CA session, or whatever... [17:54:33] why would varnish send an HTTP 503 though? [17:54:45] bblack: But it doesn't happen with other pages. Some magic confluence of CentralAuth and that page's content? [17:54:45] broken gzip handling on the apache side? [17:54:56] we've seen that before, but I'm not sure yet on "why" [17:55:31] Varnish will show this error if apache says it's sending gzipped content but then sends raw content [17:56:16] (03CR) 10EBernhardson: [C: 031] Initialize subphrases autocomplete on wikisources, mw.org and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314255 (https://phabricator.wikimedia.org/T146208) (owner: 10DCausse) [17:56:18] we saw that behaviour in https://phabricator.wikimedia.org/T146904 [17:56:25] (03CR) 10EBernhardson: [C: 031] Activate subphrases autocomplete on wikisources, mw.org and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314257 (https://phabricator.wikimedia.org/T146208) (owner: 10DCausse) [17:56:37] yeah [17:56:38] 38 FetchError c Junk after gzip data [17:56:44] ^ triggered from my curl with his CA cookies [17:56:53] (on the backend-most cache talking to appservers.svc) [17:57:26] so, his CA cookies are causing some kind of break on the MW side, and rather than returning a clean error it's serving broken gzip output [17:57:52] we've got another older ticket about similar things with hhvm, too [17:58:10] https://phabricator.wikimedia.org/T125938 [17:58:33] !log T146211: Performing rolling restart of restbase1011.eqiad.wmnet Cassandra instances, and marking SSTables unrepaired. [17:58:34] T146211: Cluster-wide major compactions: parsoid.data-parsoid table - https://phabricator.wikimedia.org/T146211 [17:58:38] I had forgotten about that one, despite it being somewhere still in my inbox [17:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:10] mediawiki.org down ? [17:59:27] err. https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker [17:59:34] that page shows that its down [17:59:45] PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161005T1800). [18:00:15] https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker is not loading for me either [18:00:25] Request from 86.183.242.164 via cp1055 cp1055, Varnish XID 3060447833 [18:00:25] Error: 503, Service Unavailable at Wed, 05 Oct 2016 17:59:55 GMT [18:00:56] okey. I thought it was just me. [18:00:59] bblack Krenair ^^ [18:01:02] so if I come from inside our infra and contact appservers.svc directly [18:01:07] I can control gzip [18:01:18] (03PS1) 10Andrew Bogott: Added role::labs::sentry [puppet] - 10https://gerrit.wikimedia.org/r/314310 (https://phabricator.wikimedia.org/T147233) [18:01:33] without gzip enabled, I do get a 500 Internal Server Error from appservers.svc [18:01:58] tonythomas, paladox: confirmed [18:02:03] Yep [18:02:17] if I add --compressed to my internal curl repro, then I get: [18:02:25] < HTTP/1.1 200 OK [18:02:33] < Content-Encoding: gzip [18:02:37] (03PS2) 10Andrew Bogott: Added role::labs::sentry [puppet] - 10https://gerrit.wikimedia.org/r/314310 (https://phabricator.wikimedia.org/T147233) [18:02:44] < Transfer-Encoding: chunked [18:02:44] < Content-Type: text/html [18:02:44] < [18:02:45] * Error while processing content unencoding: invalid code lengths set [18:02:47] * Failed writing data [18:02:50] * Closing connection 0 [18:03:24] so there's two layers of issues here: hhvm has horrible gzip-related output bugs that tend to get triggered when PHP errors happen [18:03:46] and there's a PHP error of the 500 variety happening for this request for this particular page with James_F's CA creds [18:06:02] what's also interesting: when fetched internally from appservers.svc without gzip encoding on, the "500 Internal Server Error" is accompanied by what appears at a glace to be the full correct page content output [18:06:22] oh wait I'm wrong, it's just a very long error page [18:06:48] 52KB of content for a 500 heh [18:07:25] multilingual with sidebar and all of that [18:07:49] [18:07:52]
[18:07:54] [18:07:55] ^ that's the actual error content within, it seems [18:07:58] PHP fatal error:
[18:08:00] Cannot access empty property
[18:08:03]
[18:08:40] bblack: I havne't been paying attention here, is everything OK or should we halt SWAT and the train? [18:08:58] * greg-g was in a meeting then looking at other train blockers [18:09:05] something's not ok, but it seems to be low-rate, and seems to be for only some pages for some logged-in centralauth sessions? [18:09:24] it may have already been present since ~yesterday [18:09:27] ... [18:09:40] so wmf.21? or is it on wikipedias/commons too? [18:09:56] no idea, I'm just working with 1x repro from James_F, on mediawikiwiki [18:10:02] I think there are some central auth errors in #mediawiki-core [18:10:33] bot two other people are complaining of strange problems, also on mediawikiwiki [18:10:54] (03CR) 10Andrew Bogott: [C: 032] Added role::labs::sentry [puppet] - 10https://gerrit.wikimedia.org/r/314310 (https://phabricator.wikimedia.org/T147233) (owner: 10Andrew Bogott) [18:11:04] which is ... group0 right? [18:11:08] Yeah [18:11:26] lets see if there are any recent changes to centralauth [18:11:28] so maybe hold the intended upcoming train of group1 to the same rev? [18:11:39] until we figure out what this is [18:12:44] bblack: k (cc thcipriani train also held on this weird issue bblack is investigating) [18:12:56] also oauth related [18:13:00] * thcipriani nods [18:13:11] I'm not really sure of the scope or where this started, I picked up the thread of this one case partway through from a ping and have been digging around heh [18:13:12] there's a non-blocking bug reported at https://phabricator.wikimedia.org/T147414 [18:13:33] (related to oauth) [18:13:35] Fatals are blocking, gah.... [18:13:43] <|L> https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker => 503 [18:13:45] (I didn't realize that was a fatal earlier) [18:13:53] I'm solving the TMH fatal now [18:14:20] yeah it's not obvious that it's a fatal because hhvm turns the 500 Internal Server Error showing the fatal into corrupt gzip output, so then varnish just returns a generic 503 to the user [18:15:18] There's a second fatal in TMH that seems more likely than the OAuth one [18:15:23] Patch is landing in master now [18:16:35] https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker is fatal for me while logged in heh [18:16:54] https://www.mediawiki.org/wiki/HyperSwitch is not [18:16:59] Lemme land this patch and see what it does ;-) [18:17:02] I have a suspicion! [18:17:16] you are suspicious [18:17:19] strange that https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker?action=edit works though [18:17:41] !log T146211: Performing rolling restart of RESTBase rack 'b' Cassandra instances, and marking SSTables unrepaired. [18:17:43] T146211: Cluster-wide major compactions: parsoid.data-parsoid table - https://phabricator.wikimedia.org/T146211 [18:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:56] IMHO, if it's possible, we should just disable gzip output in hhvm globally [18:19:20] varnish will still compress the responses on their way into the cache and out to the users reliably, and the internal network bandwidth savings aren't worth the corruption headaches [18:19:44] Do we have gzip turned on in MW? [18:20:06] yess [18:20:07] yes* [18:20:22] Easy enough to toggle off :) [18:20:57] But strange that it started to be a problem today. When did we switch gzip on? [18:21:02] I think hhvm is doing the gzip output encoding rather than php code itself, right? [18:21:23] paladox: the gzip thing has been a background-level issue for a long time now, we just tend to notice it when it corrupts an internal server error response [18:21:37] Oh [18:21:44] https://phabricator.wikimedia.org/T125938 [18:21:46] bblack: HHVM should be doing what MW tells it to do I'd think? [18:21:52] beats me :) [18:21:55] heh [18:22:23] Do you know what the php error is, or if we switch gzip off it will tell us the error? :) [18:22:35] The error on https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker seems to be https://phabricator.wikimedia.org/P4164 [18:22:47] RECOVERY - puppet last run on mc1016 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:23:18] what's interesting is even the old repros on this (the hhvm/gzip bug) are also apparently related to CentralAuth [18:23:19] Oh tmh was just fixed [18:23:30] (03PS1) 10Chad: Disable $wgUseGzip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314317 [18:23:35] ostriches ^^ [18:23:41] perhaps it has a particular way of failing spectacularly in the general case (as in, interrupting already-started gzipped html output) [18:25:29] yeah so that all lines up, the TMH stack trace in P4164 and the output I got from the James_F repro earlier saying "Cannot access empty property" [18:25:43] !log demon@tin Synchronized php-1.28.0-wmf.21/extensions/TimedMediaHandler/: fix fatal (duration: 00m 54s) [18:25:44] is the TMH fix fully deployed? [18:25:47] oh there we go [18:25:48] Is now [18:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:58] https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker suddenly works! [18:26:09] Yay I win the cookies! [18:26:13] Aha. [18:26:24] I no longer get the 503 on https://www.mediawiki.org/wiki/Phabricator [18:26:27] I guess it was only affecting logged-in views of pages with multimedia (as in video/audio) content? [18:26:31] Or https://www.mediawiki.org/w/index.php?title=Phabricator&diff=0&oldid=2241588 [18:26:33] Video content [18:26:36] Most likely [18:26:37] Yeah, probably. [18:27:11] Same thing with tonythomas's example of https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker [18:27:12] So, there's this wonderful dashboard in logstash we made a bit ago called "group0" [18:27:21] It's basically the MW error logs, but only group0 wikis. [18:27:26] Fatals there are *really bad* [18:27:37] Because when you go to group[12], they magnify! [18:27:41] ostriches: let me update the old ticket on this, and then let's link the wgUseGzip thing to that about avoiding corruption [18:27:53] James_F: both of them are protected pages (btw) [18:28:03] Interesting. [18:30:36] (03PS2) 10Anomie: Use logstash's prune filter for api-feature-usage-sanitized [puppet] - 10https://gerrit.wikimedia.org/r/313035 [18:32:03] 06Operations, 10Traffic, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2694266 (10BBlack) We saw this again today. There was a bug in TimedMediaHandler causing a `500 Internal Server Error` only for (at least... [18:32:21] ostriches: can you link ^ to the $wgUseGzip commit? [18:33:54] (03PS2) 10Chad: Disable $wgUseGzip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314317 (https://phabricator.wikimedia.org/T125938) [18:34:28] (03CR) 10BBlack: [C: 031] Disable $wgUseGzip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314317 (https://phabricator.wikimedia.org/T125938) (owner: 10Chad) [18:35:10] I think with TMH fixed if nobody else knows of other outstanding stuff, no reason to keep holding the upcoming train on the hour, right? [18:36:11] bblack: sadly, this one cropped back up: https://phabricator.wikimedia.org/T147359 :/ [18:36:41] (03PS3) 10Chad: Disable $wgUseGzip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314317 (https://phabricator.wikimedia.org/T125938) [18:37:27] (03CR) 10Chad: [C: 032] Disable $wgUseGzip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314317 (https://phabricator.wikimedia.org/T125938) (owner: 10Chad) [18:37:54] (03Merged) 10jenkins-bot: Disable $wgUseGzip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314317 (https://phabricator.wikimedia.org/T125938) (owner: 10Chad) [18:38:14] I am getting a fatal error („DBQueryError“) while suppressing an edit on Commonswiki. already known? [18:39:11] (03PS1) 10Andrew Bogott: Include memcached in role::simplelamp [puppet] - 10https://gerrit.wikimedia.org/r/314318 (https://phabricator.wikimedia.org/T147233) [18:39:11] !log demon@tin Synchronized wmf-config/CommonSettings.php: disable gzip internally, T125938 (duration: 00m 50s) [18:39:13] T125938: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938 [18:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:36] bblack: That *should* make that problem go away for good now [18:39:51] Unless hhvm is overriding MW and doing its own gzip ;-) [18:40:32] (03CR) 10Andrew Bogott: [C: 032] Include memcached in role::simplelamp [puppet] - 10https://gerrit.wikimedia.org/r/314318 (https://phabricator.wikimedia.org/T147233) (owner: 10Andrew Bogott) [18:44:22] Raymond_: what's the full error message? [18:44:47] greg-g: [V-VKDgpAMEcAAGDXoGgAAAAE] 2016-10-05 18:44:30: Fataler Ausnahmefehler des Typs „DBQueryError“ [18:46:36] !log T146211: Performing rolling restart of RESTBase eqiad rack 'd' Cassandra instances, and marking SSTables unrepaired. [18:46:38] T146211: Cluster-wide major compactions: parsoid.data-parsoid table - https://phabricator.wikimedia.org/T146211 [18:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:48:34] 07Puppet, 06Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2694339 (10mmodell) a:03mmodell [18:49:06] greg-g, it's T147113 - I thought it's fixed? :P [18:49:21] 07Puppet, 06Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2182293 (10mmodell) I'm doing some work on the labs role in https://gerrit.wikimedia.org/r/#/c/313937/ [18:50:08] MaxSem: sigh.... [18:50:39] commonswiki is group1, not 0, so it might be fixed by upgrading? :) [18:50:52] ostriches: assuming that's fully rolled out, I think hhvm must have its own separate setting. I can still do a curl against appservers.svc with Accept-Encoding: gzip and hhvm outputs Content-Encoding: gzip [18:51:02] wasn't schema change the fix? [18:51:21] bblack: Perhaps. Now, where's *that* config? :) [18:51:36] ostriches: oh right.... [18:54:17] bblack: Or perhaps apache? [18:54:25] ostriches: it's possible it's apache too, yeah [18:54:37] hhvm's ini config comes from e.g. hieradata/role/common/mediawiki/appserver.yaml [18:54:48] (in puppet repo, with keys like hieradata/role/common/mediawiki/appserver.yaml [18:54:51] ugh [18:54:54] We use mod_deflate on js/css. [18:55:01] with keys like hhvm::extra::fcgi [18:55:48] (03PS1) 10Dzahn: planet: fix/remove 3 remaining feeds with "no data" [puppet] - 10https://gerrit.wikimedia.org/r/314325 (https://phabricator.wikimedia.org/T134437) [18:56:03] there is a documented hhvm ini option: hhvm.server.gzip_compression_level [18:56:08] not sure if "0" there disables, doesn't say [18:57:15] 0 probably will. [18:58:29] well or it could do gzip encoding with no actual compression, which is different [18:58:46] at the php.ini level (not specific to hhvm, but hhvm docs ref it): there's "zlib.output_compression Off" [18:59:14] http://php.net/manual/en/zlib.configuration.php#ini.zlib.output-compression [18:59:20] Yeah that should be off already [18:59:45] Which is what $wgUseGzip overrides at runtime. [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161005T1900). [19:00:41] Wait. [19:00:42] I lied. [19:00:49] $wgUseGzip is useless to us [19:00:49] train on hold for https://phabricator.wikimedia.org/T147359 :(( [19:01:25] bblack: I'm sorry, it's been a long time since I looked at this. [19:01:31] gzip ain't coming from that setting [19:01:36] ah ok [19:02:51] (03PS1) 10Chad: Remove $wgUseGzip entirely, we don't use it at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314327 [19:03:08] "It's for file cache, which hasn't been used at WMF since like 1802 BC or so" [19:03:39] (03CR) 10Chad: [C: 032] Remove $wgUseGzip entirely, we don't use it at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314327 (owner: 10Chad) [19:03:42] thcipriani is group 0 on wmf21 [19:03:45] ? [19:04:01] audephone: yes, wmf.21 is still on group0 wikis [19:04:04] (03Merged) 10jenkins-bot: Remove $wgUseGzip entirely, we don't use it at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314327 (owner: 10Chad) [19:04:08] Ok [19:04:16] bblack: Anyway, back to hhvm. "When compression with gzip, this is the level of compression that will be used. 1 is fastest. 9 is best." for hhvm.server.gzip_compression_level. So...who knows [19:04:22] I was checking test wikidata earlier [19:04:34] (03PS2) 10Dzahn: planet: fix/remove 3 remaining feeds with "no data" [puppet] - 10https://gerrit.wikimedia.org/r/314325 (https://phabricator.wikimedia.org/T134437) [19:05:08] For new wikidata code (seemed ok) [19:05:23] Plus selenium tests [19:05:39] !log demon@tin Synchronized wmf-config/CommonSettings.php: remove dumb commented setting, dumb me (duration: 00m 49s) [19:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:05] (03CR) 10Dzahn: [C: 032] planet: fix/remove 3 remaining feeds with "no data" [puppet] - 10https://gerrit.wikimedia.org/r/314325 (https://phabricator.wikimedia.org/T134437) (owner: 10Dzahn) [19:06:53] 06Operations, 10Traffic, 07Beta-Cluster-reproducible, 13Patch-For-Review: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2694407 (10BBlack) ^ The above turns out to be confusingly-named but unrelated. We still haven't quite figured out h... [19:10:09] (03CR) 10Arseny1992: [C: 031] "Local community doesn't seem to object as two months have passed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304202 (https://phabricator.wikimedia.org/T142632) (owner: 10Dereckson) [19:10:31] (03PS2) 10BBlack: Text VCL: remove synth side of Win+Chrome/41 workaround [puppet] - 10https://gerrit.wikimedia.org/r/313828 (https://phabricator.wikimedia.org/T141786) [19:10:36] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: remove synth side of Win+Chrome/41 workaround [puppet] - 10https://gerrit.wikimedia.org/r/313828 (https://phabricator.wikimedia.org/T141786) (owner: 10BBlack) [19:10:50] (03PS2) 10Dzahn: ldap: migrate role classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/308314 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [19:11:06] (03PS3) 10Dzahn: ldap: migrate role classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/308314 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [19:11:20] (03CR) 10Dzahn: "needed manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/308314 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [19:12:13] 06Operations, 10Traffic, 13Patch-For-Review: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2694435 (10BBlack) 05Open>03Resolved a:03BBlack So far there doesn't seem to be any recurrence of the stats anomaly when removing the workaround. Closing for no... [19:13:50] (03PS1) 10Andrew Bogott: Add base::firewall and mediawiki::conftool into role::beta::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/314328 (https://phabricator.wikimedia.org/T147233) [19:14:59] !log rebooting maps1* for kernel upgrade [19:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:41] (03PS2) 10Andrew Bogott: Add base::firewall and mediawiki::conftool into role::beta::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/314328 (https://phabricator.wikimedia.org/T147233) [19:16:24] (03CR) 10Dzahn: [C: 032] "the class names stay the same, the content is untouched, this is only moving the files in the repo, so nothing should be impacted at all" [puppet] - 10https://gerrit.wikimedia.org/r/308314 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [19:18:47] ostriches: I think I remember ori turning off our userspace gzip quite a while ago and only relying on hhvm to do it, but I haven't looked for config patches to prove that [19:18:54] (03PS2) 10Dzahn: quarry: migrate classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/308313 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [19:20:13] 06Operations, 06Performance-Team, 10Thumbor: Some video files not recognized - https://phabricator.wikimedia.org/T147417#2694525 (10Gilles) [19:21:55] not finding obvious proof for my claim [19:22:38] userspace config of it I don't see any evidence of :) [19:22:45] I don't see hhvm config for it either! [19:24:32] looks like it was the other way around. disabled hhvm gzip in 4bdfea0 -- https://github.com/wikimedia/operations-puppet/commit/4bdfea0 [19:24:40] I wonder if that got lost at some point? [19:24:48] 06Operations, 06Labs, 13Patch-For-Review: Set up monitoring for secondary labstore HA cluster - https://phabricator.wikimedia.org/T144633#2694538 (10madhuvishy) a:03madhuvishy [19:25:14] that setting is still in ::hhvm [19:26:13] 06Operations, 06Performance-Team, 10Thumbor: Some video files not recognized - https://phabricator.wikimedia.org/T147417#2694561 (10Gilles) [19:28:49] and it looks like that 0 should disable gzip in hhvm itself -- https://github.com/facebook/hhvm/blob/10b4a1a/hphp/runtime/server/transport.cpp#L994-L995 [19:29:56] (03CR) 10Thcipriani: [C: 031] "After reviewing hosts that use role::beta::mediawiki but do *not* have base::firewall, it seems that there are no unique ports being used " [puppet] - 10https://gerrit.wikimedia.org/r/314328 (https://phabricator.wikimedia.org/T147233) (owner: 10Andrew Bogott) [19:31:55] bd808: curiouser.... [19:32:49] (03PS3) 10Andrew Bogott: Add base::firewall to role::beta::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/314328 (https://phabricator.wikimedia.org/T147233) [19:34:10] (03PS4) 10Andrew Bogott: Add base::firewall to role::beta::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/314328 (https://phabricator.wikimedia.org/T147233) [19:34:16] maybe it's apache doing it after all [19:34:25] (on html dynamic html content that is) [19:35:52] (03PS1) 10BBlack: remove old varnish geoip test [puppet] - 10https://gerrit.wikimedia.org/r/314334 (https://phabricator.wikimedia.org/T107430) [19:35:54] (03PS1) 10BBlack: remove various pointless "bits" references [puppet] - 10https://gerrit.wikimedia.org/r/314335 (https://phabricator.wikimedia.org/T107430) [19:35:56] (03PS1) 10BBlack: vk::webrequest - adjust peak rate estimates [puppet] - 10https://gerrit.wikimedia.org/r/314336 [19:36:45] (03CR) 10Andrew Bogott: [C: 032] Add base::firewall to role::beta::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/314328 (https://phabricator.wikimedia.org/T147233) (owner: 10Andrew Bogott) [19:37:26] !log deploy RESTBase 810b6aa563 canary on restbase1007 [19:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:22] (03CR) 10Dzahn: [C: 032] quarry: migrate classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/308313 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [19:38:27] (03PS3) 10Dzahn: quarry: migrate classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/308313 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [19:38:58] (03CR) 10Dzahn: "checked with watroles tool where these are used. quarry-main-01 and quarry-runner-01 will cover them all" [puppet] - 10https://gerrit.wikimedia.org/r/308313 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [19:39:22] mutante: awesome :) [19:39:33] mutante: the tedious part is really to verify the impact :( [19:39:42] (03PS2) 10BBlack: remove old varnish geoip test [puppet] - 10https://gerrit.wikimedia.org/r/314334 (https://phabricator.wikimedia.org/T107430) [19:39:47] (03CR) 10BBlack: [C: 032 V: 032] remove old varnish geoip test [puppet] - 10https://gerrit.wikimedia.org/r/314334 (https://phabricator.wikimedia.org/T107430) (owner: 10BBlack) [19:39:56] (03PS2) 10BBlack: remove various pointless "bits" references [puppet] - 10https://gerrit.wikimedia.org/r/314335 (https://phabricator.wikimedia.org/T107430) [19:40:00] (03CR) 10BBlack: [C: 032 V: 032] remove various pointless "bits" references [puppet] - 10https://gerrit.wikimedia.org/r/314335 (https://phabricator.wikimedia.org/T107430) (owner: 10BBlack) [19:40:15] bblack: I can't find anything in apache that's doing it though [19:40:32] Just mod_deflate, but that shouldn't be doing it on HTML [19:40:35] DeflateCompressionLevel 9 [19:40:35] AddOutputFilterByType DEFLATE text/css text/javascript application/x-javascript [19:40:41] hashar: yes, indeed. watroles ftw though, when i dont forget the right "syntax". https://tools.wmflabs.org/watroles/role/role::labs::quarry::celeryrunner [19:41:28] (03PS4) 10Dzahn: quarry: migrate classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/308313 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [19:42:25] hashar: re: gallium->contint1001, i can do the root steps, only part i was wondering is what time of day [19:42:37] hashar: and if there is anything left now that can be done before the date [19:43:08] hashar: and finally.. the LDAP role change is merged and had zero impact [19:43:16] \O/ [19:44:21] uhm.. i wanted to confirm the quarry change is a no-op.but [19:44:27] it's disabled on quarry-main-01 [19:45:06] but that must be no-op [19:45:19] the class names do not change [19:45:22] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2694660 (10GWicke) p:05High>03Normal [19:46:00] on quarry-runner-01 there is an unrelated problem with exim4 being started [19:46:16] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/varnish-test-geoip] [19:47:13] 06Operations, 06Discovery-Search, 07Wikimedia-Incident: Enable GC (garbage collection) logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853#2694669 (10debt) Let's go ahead and start working on this next. [19:48:16] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/varnish-test-geoip] [19:48:35] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/varnish-test-geoip] [19:49:30] bleh [19:49:32] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 13Patch-For-Review: Increase time before alert for elasticsearch disk space issues - https://phabricator.wikimedia.org/T136702#2694676 (10debt) Let's see if we can some agreement on how this will be done. [19:49:48] I think that's a race condition on the puppetmasters, the cp* puppetfails [19:50:05] (removing a fileserver file in the same commit as removing the related File resource) [19:50:10] should self-resolve on next runs [19:52:09] (03CR) 10Dzahn: "on quarry-runner-01: no-op but unrelated issue with exim4 startup" [puppet] - 10https://gerrit.wikimedia.org/r/308313 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [19:52:36] !log jessie dist-upgrade on secondary LVS servers [19:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:56:19] !log deploy RESTBase 810b6aa563 [19:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:29] (03PS1) 10Andrew Bogott: Add base::firewall to role::eventbus::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/314338 [20:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161005T2000). Please do the needful. [20:00:18] no parsoid deploy today. [20:00:22] (03CR) 10Alex Monk: [C: 031] Migrate mediawiki-firejail-convert to mediawiki-converters.profile [puppet] - 10https://gerrit.wikimedia.org/r/314233 (https://phabricator.wikimedia.org/T145811) (owner: 10Muehlenhoff) [20:00:43] nothing for ORES today [20:00:46] no mobileapps deploy today [20:01:26] https://www.youtube.com/watch?v=wIWZFXLpXSA [20:03:08] (03CR) 10Alex Monk: [C: 04-1] openstack: skip DNS update for contintcloud (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/314188 (owner: 10Hashar) [20:05:05] (03PS4) 10Rush: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [20:12:04] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:12:26] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:12:44] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:14:10] 06Operations, 06Performance-Team, 10Thumbor: thumbor: Some video files not recognized - https://phabricator.wikimedia.org/T147417#2694866 (10Dzahn) [20:18:35] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2694905 (10AlexMonk-WMF) a:05AlexMonk-WMF>03chasemp Chase is working on figuring out what else we need to do before we can run the script. https:/... [20:19:40] (03CR) 10Alex Monk: "Then please merge it, Ariel" [puppet] - 10https://gerrit.wikimedia.org/r/309709 (https://phabricator.wikimedia.org/T123607) (owner: 10Alex Monk) [20:20:40] 06Operations, 06Performance-Team, 10Thumbor: Thumbor times out on large files sometimes - https://phabricator.wikimedia.org/T147412#2694916 (10Gilles) Since we increased the HTTP_LOADER_REQUEST_TIMEOUT value from 2) to 60 seconds this issue only happened once in 4ish hours, on a 413MB PDF: ``` Oct 5 16:11:... [20:34:07] !log ran package updates on wikitech-static vm [20:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:43:07] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2695027 (10Gehel) p:05Triage>03High [20:51:24] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 715 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3077022 keys - replication_delay is 715 [20:58:49] (03PS3) 10Dzahn: toollabs: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/312284 (owner: 10Paladox) [20:59:11] (03CR) 10Dzahn: [C: 032] toollabs: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/312284 (owner: 10Paladox) [20:59:18] Thanks ^^ [21:03:58] (03CR) 10Dzahn: "double checked on tools-puppetmaster-01 no change in /etc/clustershell/groups.conf. only this instance -> https://tools.wmflabs.org/watr" [puppet] - 10https://gerrit.wikimedia.org/r/312284 (owner: 10Paladox) [21:08:15] !log thcipriani@tin Synchronized php-1.28.0-wmf.21/maintenance/lag.php: [[gerrit:314414|Make LoadMonitor use $serverIndexes in the cache key (T147359)]] PART I (duration: 00m 50s) [21:08:16] T147359: Cannot access the database: No working replica DB server: Unknown error - https://phabricator.wikimedia.org/T147359 [21:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:20] !log thcipriani@tin Synchronized php-1.28.0-wmf.21/includes/libs/rdbms: [[gerrit:314414|Make LoadMonitor use $serverIndexes in the cache key (T147359)]] PART II (duration: 00m 55s) [21:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:12:29] ^ since grrrit-wm died I just created a patch for the group1 switch to wmf.21 https://gerrit.wikimedia.org/r/#/c/314418/ [21:14:50] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.21 [21:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:14:59] ^ group1 now running wmf.21 [21:18:49] lots of error log spam: Notice: Use of undefined constant NS_TIMEDTEXT [21:19:47] and: LoadBalancer::reuseConnection: connection not found [21:20:12] That would be tmh [21:20:14] again [21:20:19] for NS_TIMEDTEXT [21:20:23] brion thedj ^^ [21:20:36] grr [21:20:55] AaronSchulz: ^^ re LoadBalancer log spam [21:21:02] :/ [21:21:06] thcipriani: not sure about that LoadBalancer but the TIMEDTEXT sounds like tmh [21:21:22] assumed 'NS_TIMEDTEXT' in /srv/mediawiki/php-1.28.0-wmf.21/extensions/TimedMediaHandler/handlers/TextHandler/TextHandler.php on line 296 [21:21:39] TMH has a hook to define it, lemme see why it's not working [21:22:38] Maybe related to the tmh timedtext update we did [21:23:53] huh [21:23:59] there is no call to CanonicalNamespaces hook that i can see in core [21:24:08] no wait i see it [21:24:34] ah i bet that's wrong place to do it [21:25:58] yeah i see [21:26:07] we don't have $wgEnableLocalTimedText on the other wikis [21:26:14] so it's not getting defined [21:26:16] i can work around thsi [21:28:06] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [21:28:35] hmmmm [21:28:35] thcipriani: revert again? looks like brion is on TMH, but not sure about the LB one :/ [21:28:54] yeah, sounds like the right thing. [21:30:10] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to 1.28.0-wmf.20 [21:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:30:28] error rate is 10x what it was, constant is only about 1/2 of that :( [21:31:18] (03PS1) 10Thcipriani: Revert "group1 wikis to 1.28.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314427 [21:31:32] thcipriani: https://gerrit.wikimedia.org/r/#/c/314426/ on master should clear up the NS_TIMEDTEXT [21:31:45] (03CR) 10Thcipriani: [C: 032] Revert "group1 wikis to 1.28.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314427 (owner: 10Thcipriani) [21:31:48] but you know what [21:31:59] i'm going to revert the change pending more testing, i just don't trust it [21:32:11] i need better test cases for Commons-like local setup [21:32:17] brion: okie doke :) [21:32:19] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.28.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314427 (owner: 10Thcipriani) [21:32:32] group1 is back on wmf.20 now [21:33:29] What's the problem? [21:34:09] deployment blockers again by that rate we'd never have stuff working lol [21:34:32] "and: LoadBalancer::reuseConnection: connection not found" [21:35:03] thciprian*i ^^ said that [21:35:04] audephone: error rate spike. TimedMediaHandler had an undefined constant and a ton of "LoadBalancer::reuseConnection: connection not found, has the connection been freed already?" [21:35:29] Wonder if it's anything related to Wikidata? [21:38:43] i'm in process of reverting the TMH bit, since we found two failures in it i want more testing [21:40:45] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:44:16] thcipriani: ok https://gerrit.wikimedia.org/r/#/c/314433/ on branch will revert the recent TMH changes that introduced the wrong property access and bad constant usage (and possibly other bugs) [21:44:31] i'll re-land it after more testing [21:45:23] brion: kk, thanks! [21:45:52] 06Operations, 10Graphite, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#2194525 (10Addshore) Is this one deployed yet? [21:48:05] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [21:52:26] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2695230 (10AlexMonk-WMF) Will setting up puppetdb fix T72792? Was anything brought up during the ops offsite that should be added to th... [21:59:13] !log thcipriani@tin Synchronized php-1.28.0-wmf.21/extensions/TimedMediaHandler: [[gerrit:314433|Revert "Rewrite discovery of TimedText tracks"]] (duration: 00m 54s) [21:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:59:29] ^ brion sync'd your TMH revert, FYI [22:00:07] tx [22:05:18] thcipriani: https://gerrit.wikimedia.org/r/#/c/314439/ more debugging info [22:07:53] AaronSchulz: okie doke +2'd will backport [22:08:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [22:10:35] (03PS1) 10Andrew Bogott: Add $use_ssl switch to role::nova::proxy [puppet] - 10https://gerrit.wikimedia.org/r/314441 [22:11:32] (03CR) 10Andrew Bogott: "@krenair: My hope is that with this change we can use this role for labs-dynamicproxy-test rather than including the dynamicproxy classes" [puppet] - 10https://gerrit.wikimedia.org/r/314441 (owner: 10Andrew Bogott) [22:15:31] !log rebooting secondary (inactive) LVS hosts for kernel updates [22:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:24:59] !log thcipriani@tin Synchronized php-1.28.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php: [[gerrit:314440|Add more information to reuseConnection() exceptions]] (duration: 00m 51s) [22:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:25:07] ^ AaronSchulz debug info is live [22:30:00] ok [22:35:13] !log jessie dist-upgrade on primary LVS servers [22:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:39] (03CR) 10Alex Monk: [C: 04-1] "Sounds good, but I think this particular commit has a problem" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314441 (owner: 10Andrew Bogott) [22:41:05] PROBLEM - DPKG on lvs1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:42:45] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openssh-client] [22:43:35] RECOVERY - DPKG on lvs1003 is OK: All packages OK [22:43:47] (03CR) 10Mattflaschen: [C: 04-1] Always set wgFlowDefaultWikiDb (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314194 (owner: 10Dereckson) [22:45:45] (03PS1) 10Yurik: Added Kartotherian allowedDomains list [puppet] - 10https://gerrit.wikimedia.org/r/314448 (https://phabricator.wikimedia.org/T147529) [22:46:20] (03PS5) 10Madhuvishy: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) [22:47:22] (03CR) 10jenkins-bot: [V: 04-1] labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [22:47:45] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:49:31] !log rebooting primary LVS hosts for kernel updates [22:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:50:02] gehel, ^^^^ [22:51:11] (03CR) 10Gehel: [C: 032] Added Kartotherian allowedDomains list [puppet] - 10https://gerrit.wikimedia.org/r/314448 (https://phabricator.wikimedia.org/T147529) (owner: 10Yurik) [22:51:43] (03PS6) 10Madhuvishy: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) [22:58:14] 06Operations, 10Deployment-Systems, 06Services, 10service-runner, 15User-mobrovac: Automate compiling service dependencies using production Jessie libraries - https://phabricator.wikimedia.org/T94611#2695447 (10mobrovac) 05Open>03Resolved a:03mobrovac The established practice is > - create a WMF J... [22:58:34] 06Operations, 10Parsoid, 06Services, 10service-runner, 15User-mobrovac: Decide whether to install heapdump by default, or continue to install npm & install on demand - https://phabricator.wikimedia.org/T95431#2695451 (10mobrovac) 05Open>03Resolved a:03mobrovac We have been going down the second rou... [22:58:38] (03CR) 10Dzahn: "is the hhvm change related to aptly and nginx?" [puppet] - 10https://gerrit.wikimedia.org/r/312562 (owner: 1020after4) [22:59:15] RECOVERY - NTP on eeden is OK: NTP OK: Offset 0.001153230667 secs [22:59:20] (03PS2) 10Dereckson: Always set wgFlowDefaultWikiDb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314194 [22:59:30] (03CR) 10Dereckson: Always set wgFlowDefaultWikiDb (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314194 (owner: 10Dereckson) [22:59:57] (03CR) 1020after4: "dzahn: yes it was a required change because both manifests were declaring the same package" [puppet] - 10https://gerrit.wikimedia.org/r/312562 (owner: 1020after4) [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161005T2300). Please do the needful. [23:00:04] Dereckson and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:07] scap has a manpage now: mutante ^^ [23:00:12] Hello. I can SWAT this evening. [23:00:51] er, scap manpage is unrelated lol [23:01:07] mutante: tin has been reprovisioned? [23:01:31] (or bastion3001) [23:01:34] Dereckson: no, bastions [23:01:36] ok [23:01:37] Dereckson: bast3001 [23:01:37] 3001 definitely has [23:01:50] reedy@tin:~$ uptime [23:01:50] 23:01:29 up 89 days, 14:50, 3 users, load average: 0.00, 0.01, 0.05 [23:01:52] ;D [23:02:45] Thanks. [23:02:47] \o [23:03:16] (03PS2) 10Dereckson: Cirrus: Support document versioning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308896 (https://phabricator.wikimedia.org/T144039) (owner: 10EBernhardson) [23:03:41] ebernhardson: "Must not be deployed until the production and beta cluster search clusters have had a cluster restart with the new version of search-extra plugin supporting this change." [23:03:55] I imagine that has been done? [23:03:57] (03PS8) 10Dzahn: Make nginx optional in aptly class [puppet] - 10https://gerrit.wikimedia.org/r/312562 (owner: 1020after4) [23:03:59] Dereckson: that's why that patch is from sept 6th :) [23:04:01] Dereckson: yes [23:04:06] (03PS1) 10Dzahn: base: activate vlan reporting via LLDP [puppet] - 10https://gerrit.wikimedia.org/r/314450 (https://phabricator.wikimedia.org/T84518) [23:04:10] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308896 (https://phabricator.wikimedia.org/T144039) (owner: 10EBernhardson) [23:04:43] (03Merged) 10jenkins-bot: Cirrus: Support document versioning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308896 (https://phabricator.wikimedia.org/T144039) (owner: 10EBernhardson) [23:05:03] no user visible change, i'll keep an eye on logstash to see if jobrunner complain about anything [23:05:36] live on mw1099 [23:05:57] (03CR) 10Dzahn: "@Alex alright, thanks -> https://gerrit.wikimedia.org/r/#/c/314450/1" [puppet] - 10https://gerrit.wikimedia.org/r/313375 (https://phabricator.wikimedia.org/T84518) (owner: 10Dzahn) [23:06:00] wont see anything from mw1099, it only effects job runners doing cirrus update jobs [23:06:55] twentyafterfour: ok, i'll merge it soon, just ran out of time right now [23:06:59] twentyafterfour: "scap has a manpage now" -> not deployed everywhere [23:07:04] already got the list of labs instances to check [23:08:03] (03PS1) 10Yurik: Fixed Kartotherian allowedDomains list indent [puppet] - 10https://gerrit.wikimedia.org/r/314451 (https://phabricator.wikimedia.org/T147529) [23:08:10] gehel, ^ [23:09:08] !log dereckson@tin Synchronized wmf-config/CirrusSearch-common.php: Cirrus: Support document versioning (T144039) (duration: 00m 50s) [23:09:08] (03CR) 10Gehel: [C: 032] Fixed Kartotherian allowedDomains list indent [puppet] - 10https://gerrit.wikimedia.org/r/314451 (https://phabricator.wikimedia.org/T147529) (owner: 10Yurik) [23:09:09] T144039: Elasticsearch document versioning doesn't work in CirrusSearch - https://phabricator.wikimedia.org/T144039 [23:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:31] (03PS2) 10Dereckson: Disable local upload on bat-smg.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304202 (https://phabricator.wikimedia.org/T142632) [23:09:46] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304202 (https://phabricator.wikimedia.org/T142632) (owner: 10Dereckson) [23:09:52] Dereckson: Can I add a patch to the end of the SWAT? I only just +2ed it in master so it's still being Jenkinsed [23:10:05] RoanKattouw: sure [23:10:20] (03Merged) 10jenkins-bot: Disable local upload on bat-smg.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304202 (https://phabricator.wikimedia.org/T142632) (owner: 10Dereckson) [23:10:23] Cool thanks, I'll send you the cherry-pick URL once I have it [23:11:46] PROBLEM - HHVM rendering on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:33] yurik: some kind of maps deploy going on? [23:12:45] PROBLEM - Apache HTTP on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:48] 304202 works fine in mw1099 [23:13:12] seen this for 1/3 checks (doesn't rise to IRC logging yet): PYBAL CRITICAL - kartotherian_6533 - Could not depool server maps1001.eqiad.wmnet because of too many down! [23:13:51] Dereckson: https://gerrit.wikimedia.org/r/#/c/314452/ , will add to the wiki page to [23:13:52] !log dereckson@tin Synchronized dblists/commonsuploads.dblist: Disable local upload on bat-smg.wikipedia (T142632) (duration: 00m 49s) [23:13:53] T142632: Restrict local uploads on bat-smg.wikipedia - https://phabricator.wikimedia.org/T142632 [23:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:08] it resolved itself... I guess temporary kartotherian outage? [23:14:20] matt_flaschen: I've updated wg = wg to wg = wmg for https://gerrit.wikimedia.org/r/#/c/314194/, looks good to you to deploy it and 314192? [23:15:30] bblack, minor config change [23:15:35] bblack, already done [23:16:15] bblack, i think gehel did a stagged eqiad deploy first (eqiad is not in prod) [23:16:36] RoanKattouw: will require a full scap, won't it? [23:17:14] (for localisation update) [23:17:16] ideally we should have a process around depooling for those kinds of things, though [23:17:17] Ugh [23:17:29] (03CR) 10Mattflaschen: [C: 031] Always set wgFlowDefaultWikiDb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314194 (owner: 10Dereckson) [23:17:34] but still, too many down == too many down :) [23:17:36] Dereckson: I mean not really, the new i18n message is only for errors [23:17:43] l10nbot will do a scap late rtonight anyway [23:17:51] ok [23:18:00] So theoretically it needs a scap but we can get away with not doing int [23:18:01] *it [23:18:02] Dereckson, yep re Flow wg = wmg. I will not be around, but RoanKattouw will be. [23:18:12] I'll be back in an hour or so. [23:18:25] okay, see you later [23:19:04] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314194 (owner: 10Dereckson) [23:19:09] (03PS3) 10Dereckson: Always set wgFlowDefaultWikiDb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314194 [23:19:26] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314194 (owner: 10Dereckson) [23:19:53] (03Merged) 10jenkins-bot: Always set wgFlowDefaultWikiDb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314194 (owner: 10Dereckson) [23:20:31] RoanKattouw: we're moving wgFlowDefaultWikiDb to be available everytime, so we can run maintenance script to create tables on wikitech at first, on private wikis who want to use Flow later [23:22:39] OK, makes sense [23:22:40] !log rebooting radon for kernel update (ns0.wikimedia.org) [23:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:11] hey, it tries to use frwiki.flow_workflow [23:23:21] Oh, I think I see the catch-22 in the previous system: when Flow is not enabled, $wgFlowDefaultWikiDb isn't set, so you can't run the script to create the Flow tables, so you have to enable Flow, but then everything explodes in a big ball of fire until you create the Flow tables [23:23:25] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2016.10.05/mediawiki/?id=AVeXJ7wqJLU8TfYDvHYI [23:23:58] That's bad [23:24:02] Looking why [23:24:43] because require_once( "$IP/extensions/Flow/Flow.php" ); reset the configuration variable [23:24:49] Yes [23:24:59] I just realized that too [23:25:04] So we need to set it twice :/ [23:25:18] * RoanKattouw thanks God Greg for the mw1099 procedure [23:25:32] oh yes, that's a good case here [23:25:34] (we would've completely broken Flow on all wikis otherwise) [23:27:37] I've prepared a fix, sending it to Gerrit. [23:29:35] Request from {my ip} via cp3009 cp3009, Varnish XID 1005845357 [23:29:35] Error: 503, Backend fetch failed at Wed, 05 Oct 2016 23:20:47 GMT [23:29:55] arseny92: where? [23:30:06] (03PS2) 10MaxSem: throttle: remove expired exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313166 [23:30:34] on jenkins while i was looking at the console outputs of your changes [23:30:56] here https://integration.wikimedia.org/ci/job/beta-scap-eqiad/123091/console [23:31:00] (03PS1) 10Dereckson: Set twice wgFlowDefaultWikiDb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314453 [23:31:12] (while it was running) [23:32:13] (03CR) 10Catrope: [C: 032] Set twice wgFlowDefaultWikiDb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314453 (owner: 10Dereckson) [23:32:40] (03Merged) 10jenkins-bot: Set twice wgFlowDefaultWikiDb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314453 (owner: 10Dereckson) [23:33:12] so Flow works again on mw1099 :) [23:33:47] arseny92: jenkins works on my side now [23:34:19] here too just reporting what happened [23:35:08] var_dump($wgFlowDefaultWikiDb); [23:35:08] bool(false) [23:35:21] works too as expected for private [23:35:25] so looks good to me [23:37:37] syncing to prod [23:38:19] RoanKattouw: no plan to use extension registration? That would have also avoided this issue. [23:38:22] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Always set wgFlowDefaultWikiDb ([[Gerrit:314194]] and [[Gerrit:314453]]) (duration: 00m 50s) [23:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:43] Flow still working everywhere [23:38:49] Dereckson: For Flow? Yeah that would be nice but IIRC legoktm said there were problems with using that at least for Echo, not sure about Flow [23:39:00] There's a patch to convert Flow to extension registration but it's old [23:39:23] 06Operations, 10ops-ulsfo: cr1-ulsfo broken serial cable (or port) - https://phabricator.wikimedia.org/T147430#2695526 (10RobH) a:03RobH [23:39:49] (03CR) 10Dereckson: "Follow-up: https://gerrit.wikimedia.org/r/#/c/314453/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314194 (owner: 10Dereckson) [23:46:00] I've got a power failure at home (and in the neighborhood). Sorry for the delay. [23:46:40] So 314452 [23:47:12] oh I thought you CR+2 the cherry-pick too, not only the master branch one [23:48:01] zuul isn't busy, that should be fast [23:48:27] Sorry, I forgot [23:48:35] I also had a wifi failure around the same time as your poewr failure [23:48:49] My laptop fell off the wifi and didn't get back on until I turned the wifi driver off and back on [23:48:50] :/ [23:49:18] ( https://www.youtube.com/watch?v=p85xwZ_OLX0 ) [23:49:48] always annoying this resolution method [23:50:31] (03PS3) 10Dereckson: Set Flow database for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314192 (https://phabricator.wikimedia.org/T127792) [23:51:20] (03CR) 10Dereckson: "PS3: rebased against 6e687396 and 82f2d1cf." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314192 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [23:52:11] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314192 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [23:52:37] (03Merged) 10jenkins-bot: Set Flow database for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314192 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [23:54:14] !log cache_misc: rolling depooled frontend restarts for libvmod-netmapper upgrade [23:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:55:30] works on mw1099 [23:55:33] before: flowdb, now: false [23:57:37] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set Flow database for wikitech (T127792) (duration: 00m 50s) [23:57:38] T127792: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792 [23:57:42] So now, to create Flow tables for wikitech is possible. [23:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:45] thcipriani: did we resolve the issues with wmf.21 yet? [23:57:49] 314452 is merged [23:58:16] * aude looks at phabricator