[00:07:13] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:08:41] (03PS9) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [00:08:57] 06Operations, 06Analytics-Kanban, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2895170 (10Nuria) @JKatzWMF Besides documenting this fact as one on the dataset (super thanks for reporting!) I do not think there is any... [00:09:15] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2895171 (10Nuria) a:05mforns>03None [00:09:54] bblack: yt? [00:11:42] 06Operations, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: labservices1001 down, suspected overheating - https://phabricator.wikimedia.org/T152340#2895203 (10Andrew) a:05Andrew>03None [00:12:14] bblack: Sending this ticket your way: https://phabricator.wikimedia.org/T148780 which has to do with changes to meta tags to mark internal referrals [00:18:22] 06Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2895208 (10Nuria) >Is that testing framework also planned to work with central notice/banners, or is that a separate infrastructure? couldn't say w/o knowing how central banner infarstructure works.... [00:19:35] (03PS1) 10Dzahn: ganglia: switch eqiad aggregator from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328599 (https://phabricator.wikimedia.org/T123733) [00:19:53] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [00:20:09] try grep'ing for "carbon" to find the hostname [00:20:20] you find all the totally unrelated graphite "carbon" stuff [00:20:32] carbon-c-relay and so on [00:20:52] makes up new rule that hostnames should not also be service names [00:23:11] (03PS2) 10Dzahn: ganglia: switch eqiad aggregator from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328599 (https://phabricator.wikimedia.org/T123733) [00:26:45] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), and 2 others: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2895217 (10GWicke) 05Open>03Resolved a:03GWicke The electron render service is deployed & exposed... [00:30:22] 06Operations, 13Patch-For-Review: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#1936623 (10Dzahn) Is this actually still "migrate carbon" or is it now "decom carbon" since the mirror part was done on sodium meanhile. ? (T84817 is that resolved too?) [00:30:33] (03PS1) 10Dzahn: wmflib: replace carbon with install1001 in ipresolve tests [puppet] - 10https://gerrit.wikimedia.org/r/328600 (https://phabricator.wikimedia.org/T123733) [00:31:41] (03PS2) 10Dzahn: wmflib: replace carbon with install1001 in ipresolve tests [puppet] - 10https://gerrit.wikimedia.org/r/328600 (https://phabricator.wikimedia.org/T123733) [00:35:58] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2895254 (10Dzahn) [00:36:35] 06Operations: Setup install server in codfw - tftp done, but not apt and other install services - https://phabricator.wikimedia.org/T84380#2895257 (10Dzahn) a:03Dzahn [00:39:00] 06Operations: Setup install server in codfw - tftp done, but not apt and other install services - https://phabricator.wikimedia.org/T84380#2895261 (10Dzahn) meanwhile install2001 uses identical puppet roles as install1001 and carbon, and DHCP is running now (done as part of T132757) routers have been configured... [00:39:13] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:42:46] (03Abandoned) 10Dzahn: mariadb: split role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [00:44:44] (03Abandoned) 10Dzahn: base/monitoring: add optional SMART disk check [puppet] - 10https://gerrit.wikimedia.org/r/304580 (owner: 10Dzahn) [00:45:59] (03Abandoned) 10Dzahn: add AAAA and PTR for cp1008 [dns] - 10https://gerrit.wikimedia.org/r/316036 (owner: 10Dzahn) [00:46:22] (03PS3) 10Dzahn: varnish misc: add phab2001 as a backend for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) [00:47:02] (03PS2) 10Dzahn: remove wikipedia.org.br [dns] - 10https://gerrit.wikimedia.org/r/327280 (https://phabricator.wikimedia.org/T137105) [00:47:36] (03CR) 10Dzahn: [C: 032] remove wikipedia.org.br [dns] - 10https://gerrit.wikimedia.org/r/327280 (https://phabricator.wikimedia.org/T137105) (owner: 10Dzahn) [00:48:49] (03CR) 10Dzahn: "wikipedia.org.br has address 162.243.245.212" [dns] - 10https://gerrit.wikimedia.org/r/327280 (https://phabricator.wikimedia.org/T137105) (owner: 10Dzahn) [00:50:05] (03CR) 10Dzahn: "alright, gotcha, wouldn't a merge still be easy and not incorrect though" [puppet] - 10https://gerrit.wikimedia.org/r/327426 (owner: 10Dzahn) [00:50:19] (03Abandoned) 10Dzahn: hhvm: add missing " in hhvm.default.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/327426 (owner: 10Dzahn) [00:51:13] (03PS3) 10Dzahn: icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) [00:51:57] (03CR) 10jerkins-bot: [V: 04-1] icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [00:54:53] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [01:02:47] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2895337 (10fgiunchedi) [01:03:37] (03PS1) 10Dzahn: remove IDNs that are not registered by us anymore [dns] - 10https://gerrit.wikimedia.org/r/328604 (https://phabricator.wikimedia.org/T137105) [01:07:13] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [01:07:58] (03PS2) 10Dzahn: toollabs/CI: give banner scripts an .sh extension [puppet] - 10https://gerrit.wikimedia.org/r/327673 (https://phabricator.wikimedia.org/T148494) [01:10:01] (03CR) 10Dzahn: [C: 032] remove absented file long gone [puppet] - 10https://gerrit.wikimedia.org/r/328596 (owner: 10Matanya) [01:11:30] (03CR) 10Dzahn: ":( that is very true, can't imagine who else would do this" [puppet] - 10https://gerrit.wikimedia.org/r/287663 (owner: 10Yuvipanda) [01:13:49] (03CR) 10Dzahn: "this might have been doing too many things at once (split up classes but ALSO do all kinds of other cleanup), how about doing _just_ the s" [puppet] - 10https://gerrit.wikimedia.org/r/287663 (owner: 10Yuvipanda) [01:20:55] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2597699 (10Smalyshev) @Gehel I imagine this is all done? [01:22:53] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:34:53] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1806.925358 Seconds [01:34:53] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1809.86724 Seconds [01:35:53] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 31.171255 Seconds [01:35:53] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 34.220621 Seconds [01:50:53] (03PS1) 10Alex Monk: graphite: Don't use wikitech API to find labs projects/instances [puppet] - 10https://gerrit.wikimedia.org/r/328608 (https://phabricator.wikimedia.org/T104575) [02:09:26] (03PS1) 10Alex Monk: labstore: Don't use wikitech API to find labs instances in nfs-exportd [puppet] - 10https://gerrit.wikimedia.org/r/328609 (https://phabricator.wikimedia.org/T104575) [02:21:34] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 07m 54s) [02:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:24] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Dec 22 02:26:23 UTC 2016 (duration 4m 49s) [02:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:42] (03PS1) 10Alex Monk: Move shinkengen from using LDAP to the OpenStack APIs [puppet] - 10https://gerrit.wikimedia.org/r/328611 [02:34:53] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [02:50:53] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:01:43] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [03:18:46] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2895420 (10Esc3300) T153897 seems to illustrate the problem with not using http. [03:18:53] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [03:25:23] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:28:46] 06Operations, 06Commons, 10TimedMediaHandler, 10Wikimedia-Video: Creation of derivative video files is not working - https://phabricator.wikimedia.org/T153852#2895427 (10Pokefan95) >>! In T153852#2893196, @Revent wrote: > It's simply that the backlog became extremely large due to a high number of 'huge' (1... [03:31:15] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Pokefan95) [03:40:13] 06Operations, 06Commons, 10TimedMediaHandler, 10Wikimedia-Video: Creation of derivative video files is not working - https://phabricator.wikimedia.org/T153852#2895439 (10Revent) @Pokefan95 When I said 1920P, https://commons.wikimedia.org/wiki/File:Moscow_Ring_Railway_full_trip_-_view_from_ES2G_train.webm i... [03:45:43] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:54:23] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:02:26] (03PS1) 10Aaron Schulz: Include DB shard as a logstash column [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328618 [04:08:43] PROBLEM - puppet last run on db1087 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:13:43] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [04:15:33] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3095.80 Read Requests/Sec=3799.30 Write Requests/Sec=9.70 KBytes Read/Sec=15201.60 KBytes_Written/Sec=224.40 [04:17:33] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3100.10 Read Requests/Sec=1478.00 Write Requests/Sec=15.60 KBytes Read/Sec=28882.40 KBytes_Written/Sec=84.40 [04:28:33] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.40 Read Requests/Sec=168.70 Write Requests/Sec=0.80 KBytes Read/Sec=1606.00 KBytes_Written/Sec=261.60 [04:31:53] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:36:43] RECOVERY - puppet last run on db1087 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [05:00:53] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [05:02:03] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:11:32] 06Operations, 06Commons, 10TimedMediaHandler, 10Wikimedia-Video: Creation of derivative video files is not working - https://phabricator.wikimedia.org/T153852#2895475 (10zhuyifei1999) (Just a note about the terminology: 1920 x 1080 px is usually referred as [[https://en.wikipedia.org/wiki/1080p|1080p]], as... [05:30:03] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [05:38:19] (03CR) 10Smalyshev: "We may want to deploy this before we deploy next WDQS, so that it will set correct vars." [puppet] - 10https://gerrit.wikimedia.org/r/328582 (https://phabricator.wikimedia.org/T153897) (owner: 10Smalyshev) [05:39:10] (03PS2) 10Smalyshev: Add configuration for query endpoint URL [puppet] - 10https://gerrit.wikimedia.org/r/328582 (https://phabricator.wikimedia.org/T153897) [05:45:53] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2895494 (10Smalyshev) @Esc3300 no, that's completely different problem, actually a configuration bug having nothing to do with this one :)... [05:59:07] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2895515 (10Esc3300) Good point. So that all links on http://ldfclient.wmflabs.org/ point to http:// is a different problem. [06:03:28] 06Operations, 10Traffic, 10Wikimedia-Shop, 07HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#2078914 (10Jseddon) Hey @BBlack, Apologies, this got dropped. Either myself or @MBeat33 will get back to you as to whether we can actually make this change or not, although my h... [06:27:53] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:40:53] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[httpry] [06:55:03] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:56:53] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:08:53] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:11:53] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:24:03] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:26:49] 06Operations, 15User-Elukey: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2895574 (10elukey) [07:26:54] !log created /var/log/squid3/access.log.1.gz on aluminum to fix cronspam - T132324 [07:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:57] T132324: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324 [07:34:28] 06Operations: Logrotate fails for: "$FILE No such file or directory" - https://phabricator.wikimedia.org/T153940#2895579 (10elukey) [07:36:31] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.36 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328546 (owner: 10Muehlenhoff) [07:40:53] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [07:48:26] (03CR) 10Alexandros Kosiaris: [C: 032] wmflib: replace carbon with install1001 in ipresolve tests [puppet] - 10https://gerrit.wikimedia.org/r/328600 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [07:51:45] (03PS3) 10Alexandros Kosiaris: wmflib: replace carbon with install1001 in ipresolve tests [puppet] - 10https://gerrit.wikimedia.org/r/328600 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [07:51:49] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] wmflib: replace carbon with install1001 in ipresolve tests [puppet] - 10https://gerrit.wikimedia.org/r/328600 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [07:53:48] (03PS1) 10Muehlenhoff: Update to 4.4.37 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328633 [07:53:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "requires that install1001 has the ganglia::monitor::aggregator class first" [puppet] - 10https://gerrit.wikimedia.org/r/328599 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [07:55:13] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:55:48] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2895610 (10akosiaris) >>! In T132757#2894515, @Dzahn wrote: > - @akosiaris configured switches so that public1-b-eqiad and public1-c-eqiad are using install1001 as DHCP Corr... [07:57:53] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:58:21] (03PS9) 10Muehlenhoff: Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 [07:59:49] (03PS7) 10Muehlenhoff: Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) [08:08:21] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce dbmonitor1001, dbmonitor2001 [puppet] - 10https://gerrit.wikimedia.org/r/328509 (https://phabricator.wikimedia.org/T149557) (owner: 10Alexandros Kosiaris) [08:08:28] (03PS2) 10Alexandros Kosiaris: Introduce dbmonitor1001, dbmonitor2001 [puppet] - 10https://gerrit.wikimedia.org/r/328509 (https://phabricator.wikimedia.org/T149557) [08:09:08] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Introduce dbmonitor1001, dbmonitor2001 [puppet] - 10https://gerrit.wikimedia.org/r/328509 (https://phabricator.wikimedia.org/T149557) (owner: 10Alexandros Kosiaris) [08:15:04] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.37 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328633 (owner: 10Muehlenhoff) [08:18:23] !log installing Django security updates [08:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:53] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[imagemagick] [08:21:53] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [08:24:13] RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:25:32] (03PS1) 10Muehlenhoff: Update to 4.4.38 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328634 [08:25:53] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [08:40:23] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:48] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.38 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328634 (owner: 10Muehlenhoff) [08:43:14] (03CR) 10Jcrespo: "This is ok, but it will need a followup for db1069, or it will fail every time." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [08:45:53] !log installing libav security updates on trusty systems [08:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:40] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2895659 (10elukey) [09:02:54] !log installing tomcat security updates [09:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:23] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:20:43] (03PS5) 10Marostegui: [WIP] Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [09:21:09] (03PS1) 10Alexandros Kosiaris: Add k8s_infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/328642 [09:23:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add k8s_infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/328642 (owner: 10Alexandros Kosiaris) [09:24:55] (03CR) 10Marostegui: [WIP] Reporting tests with the private data script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [09:26:04] 06Operations, 10TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#2895698 (10elukey) [09:26:06] 06Operations, 10Wikimedia-General-or-Unknown, 10hardware-requests: Extend capacity for video scalers - https://phabricator.wikimedia.org/T150067#2895700 (10elukey) [09:26:46] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2895701 (10elukey) p:05High>03Normal [09:31:45] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2895709 (10elukey) The queue, as far as I can see, has been processing tasks during the past day without hittin... [09:34:53] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:34:59] (03CR) 10Jcrespo: [WIP] Reporting tests with the private data script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [09:35:08] (03PS6) 10Alexandros Kosiaris: kubernetes::master: Introduce the kubernetes profile [puppet] - 10https://gerrit.wikimedia.org/r/328174 [09:35:10] (03PS6) 10Alexandros Kosiaris: Create and assign the kubernetes::master role [puppet] - 10https://gerrit.wikimedia.org/r/328175 [09:35:13] (03PS19) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [09:35:17] (03PS18) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [09:36:09] (03CR) 10Marostegui: [WIP] Reporting tests with the private data script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [09:45:06] (03PS1) 10Nemo bis: [Planet Wikimedia] Update .mau. feed URL on Italian planet [puppet] - 10https://gerrit.wikimedia.org/r/328645 [09:52:23] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 07Wikimedia-Incident: Deploy WDQS nodes on codfw - https://phabricator.wikimedia.org/T124862#2895722 (10Gehel) [09:52:27] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2895717 (10Gehel) 05Open>03Resolved Oops... yes, it has been done for some time... We now have 2 new servers (T152643 and T152644) bu... [09:52:51] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 07Wikimedia-Incident: Deploy WDQS nodes on codfw - https://phabricator.wikimedia.org/T124862#1968603 (10Gehel) 05Open>03Resolved a:03Gehel [09:52:53] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#2895726 (10Gehel) [09:52:59] (03CR) 10Ema: tlsproxy::localssl: add ability to have an access.log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328495 (https://phabricator.wikimedia.org/T153797) (owner: 10Giuseppe Lavagetto) [09:53:36] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2855908 (10Gehel) [09:53:40] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#2855891 (10Gehel) [09:53:42] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#1960976 (10Gehel) [09:59:41] PROBLEM - HTTPS-tendril on dbmonitor1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:02:03] !log installing c-ares security updates [10:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:21] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [10:02:32] !log installing c-ares security updates on trusty systems (jessie already fixed for quite a while) [10:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:21] (03CR) 10Marostegui: [WIP] Reporting tests with the private data script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [10:27:20] (03PS4) 10Gehel: New upstream version: 1.11.0 [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320992 (https://phabricator.wikimedia.org/T150408) [10:30:57] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2895742 (10Gehel) The [[ https://gerrit.wikimedia.org/r/#/c/320992/ | updated liblogstash-g... [10:31:19] PROBLEM - HTTPS-tendril on dbmonitor2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:40:27] (03PS1) 10Muehlenhoff: Update to 4.4.39 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328649 [10:41:02] (03PS6) 10Marostegui: Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [10:43:12] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.39 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328649 (owner: 10Muehlenhoff) [10:52:20] (03PS1) 10Muehlenhoff: Update date in changelog for build [debs/linux44] - 10https://gerrit.wikimedia.org/r/328650 [10:52:24] (03PS7) 10Marostegui: Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [10:58:16] (03PS2) 10Alexandros Kosiaris: k8s::apiserver: Allow specifying the SSL file paths [puppet] - 10https://gerrit.wikimedia.org/r/328553 [10:58:37] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "tested in toollabs, looks fine, yuvi gave an ok on IRC, merging" [puppet] - 10https://gerrit.wikimedia.org/r/328553 (owner: 10Alexandros Kosiaris) [10:59:00] (03CR) 10Muehlenhoff: [C: 032] Update date in changelog for build [debs/linux44] - 10https://gerrit.wikimedia.org/r/328650 (owner: 10Muehlenhoff) [10:59:38] (03PS7) 10Alexandros Kosiaris: kubernetes::master: Introduce the kubernetes profile [puppet] - 10https://gerrit.wikimedia.org/r/328174 [11:00:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/4973/argon.eqiad.wmnet/, merging" [puppet] - 10https://gerrit.wikimedia.org/r/328174 (owner: 10Alexandros Kosiaris) [11:00:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/4973/argon.eqiad.wmnet/, merging" [puppet] - 10https://gerrit.wikimedia.org/r/328175 (owner: 10Alexandros Kosiaris) [11:00:43] (03PS7) 10Alexandros Kosiaris: Create and assign the kubernetes::master role [puppet] - 10https://gerrit.wikimedia.org/r/328175 [11:00:47] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Create and assign the kubernetes::master role [puppet] - 10https://gerrit.wikimedia.org/r/328175 (owner: 10Alexandros Kosiaris) [11:04:09] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:04:10] PROBLEM - puppet last run on chlorine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[kube-apiserver] [11:04:19] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[kube-apiserver] [11:04:39] PROBLEM - Check systemd state on argon is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:06:29] PROBLEM - Check systemd state on dbmonitor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:06:41] all of these expected ^ [11:10:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [11:11:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3921448 keys, up 52 days 2 hours - replication_delay is 45 [11:14:39] RECOVERY - HTTPS-tendril on dbmonitor1001 is OK: SSL OK - Certificate tendril.wikimedia.org valid until 2017-03-17 11:00:15 +0000 (expires in 84 days) [11:15:29] RECOVERY - Check systemd state on dbmonitor1001 is OK: OK - running: The system is fully operational [11:17:30] (03PS1) 10Alexandros Kosiaris: tendril: Add all the required apache modules [puppet] - 10https://gerrit.wikimedia.org/r/328653 [11:20:23] (03PS2) 10Alexandros Kosiaris: tendril: Add all the required apache modules [puppet] - 10https://gerrit.wikimedia.org/r/328653 [11:25:30] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2895794 (10Revent) Jobs seem to be processing without problems... I have seen tasks completing successfully in... [11:28:27] (03PS1) 10Muehlenhoff: yarn web ui: Restrict to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/328654 [11:32:29] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/4974/, merging" [puppet] - 10https://gerrit.wikimedia.org/r/328653 (owner: 10Alexandros Kosiaris) [11:34:20] RECOVERY - HTTPS-tendril on dbmonitor2001 is OK: SSL OK - Certificate tendril.wikimedia.org valid until 2017-03-17 11:00:15 +0000 (expires in 84 days) [11:46:53] (03PS1) 10Alexandros Kosiaris: Add dbmonitor1001, dbmonitor2001 to network::constants [puppet] - 10https://gerrit.wikimedia.org/r/328656 [11:49:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add dbmonitor1001, dbmonitor2001 to network::constants [puppet] - 10https://gerrit.wikimedia.org/r/328656 (owner: 10Alexandros Kosiaris) [11:51:12] hmm this ^ triggers a cluster wide ntp reload... [11:51:39] I am gonna be monitoring it in case anything goes wrong. I don't expect much though [11:54:10] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [11:56:10] RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset -0.000962 secs [11:56:50] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [11:58:10] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:58:50] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset -0.006757 secs [11:59:41] (03PS2) 10Elukey: yarn web ui: Restrict to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/328654 (owner: 10Muehlenhoff) [12:00:03] moritzm: if you are ok I can merge --^ [12:00:10] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [12:01:05] (03CR) 10Elukey: [C: 032] yarn web ui: Restrict to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/328654 (owner: 10Muehlenhoff) [12:01:10] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset 0.001758 secs [12:02:09] well it seem super safe, I'll go ahead :) [12:03:00] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [12:03:12] all these ^ are expected [12:04:30] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:00] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 0.000413 secs [12:05:09] (03PS1) 10Alexandros Kosiaris: profile::kubernetes: Specify use_package parameter [puppet] - 10https://gerrit.wikimedia.org/r/328658 [12:05:28] moritzm: just ran puppet on an1001, all good [12:08:15] 06Operations, 10vm-requests, 13Patch-For-Review: Site: 2 VM request for tendril - https://phabricator.wikimedia.org/T149557#2895866 (10akosiaris) [12:09:00] 06Operations, 10vm-requests, 13Patch-For-Review: Site: 2 VM request for tendril - https://phabricator.wikimedia.org/T149557#2756016 (10akosiaris) 05Open>03Resolved a:03akosiaris VMs are up and running, tendril runs on them (with LDAP auth on, over HTTPS), resolving this. [12:11:00] elukey: sure, please go ahead :-) [12:12:03] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2895874 (10zhuyifei1999) (that was #video2commons) [12:12:25] (03PS2) 10Alexandros Kosiaris: profile::kubernetes: Specify use_package parameter [puppet] - 10https://gerrit.wikimedia.org/r/328658 [12:13:00] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [12:14:00] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset -0.00072 secs [12:14:25] (03CR) 10Alexandros Kosiaris: [C: 032] profile::kubernetes: Specify use_package parameter [puppet] - 10https://gerrit.wikimedia.org/r/328658 (owner: 10Alexandros Kosiaris) [12:17:00] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [12:18:00] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset 0.00012 secs [12:25:10] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:31:24] (03PS1) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [12:32:19] (03CR) 10jerkins-bot: [V: 04-1] RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [12:32:30] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:33:40] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:33:45] 06Operations, 10Wikimedia-Extension-setup, 07Tamil-Sites: Enable Extension:ShortUrl on or.wikipedia, ta.wikipedia... - https://phabricator.wikimedia.org/T3450#2895952 (10MarcoAurelio) [12:34:50] (03PS1) 10Alexandros Kosiaris: k8s::apiserver: Remove redundant --tls-cert-file [puppet] - 10https://gerrit.wikimedia.org/r/328661 [12:35:04] 06Operations, 10Wikimedia-Extension-setup, 07Tamil-Sites: Enable Extension:ShortUrl on or.wikipedia, ta.wikipedia... - https://phabricator.wikimedia.org/T3450#2896016 (10MarcoAurelio) [12:35:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] k8s::apiserver: Remove redundant --tls-cert-file [puppet] - 10https://gerrit.wikimedia.org/r/328661 (owner: 10Alexandros Kosiaris) [12:44:40] PROBLEM - Host puppetmaster2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:30] RECOVERY - Host puppetmaster2001 is UP: PING OK - Packet loss = 0%, RTA = 36.00 ms [12:46:30] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [12:47:20] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:47:20] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:47:20] PROBLEM - puppet last run on mw2247 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:47:30] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:47:30] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3918836 keys, up 52 days 4 hours - replication_delay is 0 [12:47:30] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:47:40] (03PS2) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [12:48:35] (03CR) 10jerkins-bot: [V: 04-1] RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [12:48:40] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py],File[/usr/local/bin/puppet-enabled],File[/usr/lib/nagios/plugins/check_sysctl],File[/etc/sysctl.d] [12:57:05] (03PS3) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [12:58:58] (03CR) 10jerkins-bot: [V: 04-1] RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [13:01:20] PROBLEM - puppet last run on mc2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:01:20] PROBLEM - puppet last run on restbase-test2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:01:40] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:02:00] PROBLEM - puppet last run on db2045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:13:00] PROBLEM - puppet last run on db1083 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:13:20] RECOVERY - puppet last run on restbase-test2002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:14:20] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [13:14:20] RECOVERY - puppet last run on mc2016 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [13:14:20] RECOVERY - puppet last run on mw2247 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:14:30] RECOVERY - puppet last run on mw2244 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [13:14:30] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [13:14:40] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:15:00] RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [13:15:20] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:16:05] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2896071 (10Urbanecm) [13:34:28] (03PS1) 10Muehlenhoff: hive/metastore: Restrict to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/328664 [13:40:01] RECOVERY - puppet last run on db1083 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [13:41:22] !log stopping db1035 (depooled) replication to perform maintenance to avoid disk alerts in the next 2 weeks [13:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:40] RECOVERY - Check systemd state on argon is OK: OK - running: The system is fully operational [13:52:45] (03PS1) 10Muehlenhoff: eventbus: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/328665 [13:56:30] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 635 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3920948 keys, up 52 days 5 hours - replication_delay is 635 [13:57:16] (03PS1) 10Alexandros Kosiaris: kubernetes::master: Fix typo in hiera for etcd urls [puppet] - 10https://gerrit.wikimedia.org/r/328666 [13:57:46] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes::master: Fix typo in hiera for etcd urls [puppet] - 10https://gerrit.wikimedia.org/r/328666 (owner: 10Alexandros Kosiaris) [13:58:56] (03PS1) 10Eevans: [WIP]: Enable Cassandra on restbase-test100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) [13:59:22] (03PS2) 10Eevans: [WIP]: Enable Cassandra on restbase-test100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) [14:04:30] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3914852 keys, up 52 days 5 hours - replication_delay is 10 [14:11:00] PROBLEM - Hadoop NodeManager on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:11:11] checking.. [14:13:55] !log manually starting the yarn nodemanager after OOM [14:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:00] RECOVERY - Hadoop NodeManager on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:14:13] !log the previous entry is missing: "on analytics1032" [14:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:47] (03PS4) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [14:20:11] (03CR) 10jerkins-bot: [V: 04-1] RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [14:21:32] (03CR) 10Eevans: "@mobrovac Nice; Thanks for banging this out!" [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [14:22:11] (03PS5) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [14:23:36] (03CR) 10Mobrovac: "I was thinking the same re repo, but I figure we can start with this and then switch to full-blown deploys if really needed." [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [14:28:34] (03PS6) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [14:31:30] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [14:40:21] (03PS7) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [14:44:16] (03PS8) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [14:49:27] (03PS9) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [14:51:16] !log restarting the yarn node manager java daemons on all the Hadoop worker nodes due to suspect memory leak [14:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:40] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [14:53:20] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:26] (03PS2) 10Jcrespo: mariadb: Backup new m5 databases striker and labsdbaccounts [puppet] - 10https://gerrit.wikimedia.org/r/328476 [14:55:28] (03PS1) 10Jcrespo: dbstore: configuration changes to make InnoDB the main storage [puppet] - 10https://gerrit.wikimedia.org/r/328671 (https://phabricator.wikimedia.org/T130128) [14:56:20] (03CR) 10Jcrespo: [C: 04-1] "Cannot deploy yet, dbstore1001/2 are still using TokuDB." [puppet] - 10https://gerrit.wikimedia.org/r/328671 (https://phabricator.wikimedia.org/T130128) (owner: 10Jcrespo) [14:57:50] (03PS1) 10Gilles: Configure SMTP for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/328673 (https://phabricator.wikimedia.org/T153167) [14:58:57] (03CR) 10Eevans: "> I was thinking the same re repo, but I figure we can start with" [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [14:59:30] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:00:10] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 14 failures [15:00:53] ^^ rigel alert is because of a puppetmaster restart [15:00:59] (03CR) 10Mobrovac: "> WFM; Should we delete that repo then?" [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [15:01:36] (03CR) 10Mobrovac: "PCC looking good - https://puppet-compiler.wmflabs.org/4984/" [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [15:02:15] jynus: ping [15:02:33] mafk, hi [15:02:40] jynus: can I PM? [15:02:46] yes [15:02:50] tnx [15:05:10] RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 221 seconds ago with 0 failures [15:05:10] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:05:10] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 1 failures [15:05:23] grrr. [15:06:00] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 1 failures [15:06:40] RECOVERY - check_puppetrun on alnilam is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:09:52] (03CR) 10Mobrovac: "LGTM (modulo the missing IPs, ofc)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [15:10:10] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 205 seconds ago with 0 failures [15:18:08] (03PS2) 10Elukey: hive/metastore: Restrict to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/328664 (owner: 10Muehlenhoff) [15:18:33] (03CR) 10Eevans: [WIP]: Enable Cassandra on restbase-test100[1-3] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [15:20:10] PROBLEM - check_puppetrun on frauth1001 is CRITICAL: CRITICAL: Puppet has 25 failures [15:22:22] (03CR) 10Elukey: [C: 032] hive/metastore: Restrict to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/328664 (owner: 10Muehlenhoff) [15:23:20] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [15:25:10] RECOVERY - check_puppetrun on frauth1001 is OK: OK: Puppet is currently enabled, last run 257 seconds ago with 0 failures [15:26:40] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [15:30:10] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 200 seconds ago with 0 failures [15:35:10] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:36:30] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:39:35] !log restart dbstore2001 to change buffer pool size, testing gerrit:328671 [15:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:22] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services (watching): Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590#2896265 (10mobrovac) [15:56:12] 06Operations, 06Performance-Team, 10Thumbor: Gifsicle engine: AttributeError: 'Engine' object has no attribute 'exif' - https://phabricator.wikimedia.org/T145504#2896276 (10Gilles) 05Open>03Resolved PR merged [15:58:19] (03PS1) 10Muehlenhoff: Rename ferm service in role::labs::db::replica [puppet] - 10https://gerrit.wikimedia.org/r/328683 [16:00:30] 06Operations, 06Performance-Team, 10Thumbor: Implement PoolCounter support in Thumbor - https://phabricator.wikimedia.org/T151066#2896285 (10Gilles) p:05Normal>03Triage [16:00:52] 06Operations, 06Analytics-Kanban, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2896288 (10elukey) [16:01:31] 06Operations, 06Analytics-Kanban, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2896303 (10elukey) [16:03:54] (03CR) 10Jcrespo: [C: 04-1] "I am ok with changing the name, but name it labs_db_replica, replica only is confusing." [puppet] - 10https://gerrit.wikimedia.org/r/328683 (owner: 10Muehlenhoff) [16:08:14] (03PS2) 10Muehlenhoff: Rename ferm service in role::labs::db::replica [puppet] - 10https://gerrit.wikimedia.org/r/328683 [16:09:30] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:10:35] (03CR) 10Jcrespo: [C: 031] Rename ferm service in role::labs::db::replica [puppet] - 10https://gerrit.wikimedia.org/r/328683 (owner: 10Muehlenhoff) [16:11:33] 06Operations, 06Performance-Team, 10Thumbor: Implement PoolCounter support in Thumbor - https://phabricator.wikimedia.org/T151066#2896336 (10Gilles) p:05Triage>03Normal [16:15:07] 06Operations, 06Analytics-Kanban, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2896357 (10elukey) p:05Triage>03Normal [16:15:29] 06Operations, 06Analytics-Kanban, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2896288 (10elukey) `15:51 !log restarting the yarn node manager java daemons on all the Hadoop worker nodes due to suspect memory leak` [16:18:01] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2896359 (10Dzahn) oops, yea, of course, private(!)-eqiad, consider that a typo [16:20:12] (03CR) 10Dzahn: "yea. so i had this too https://gerrit.wikimedia.org/r/#/c/328450/ but i should merge it into a single thing," [puppet] - 10https://gerrit.wikimedia.org/r/328599 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [16:23:06] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 05Security: Gerrit: Convert gerrit's db caractor encoding from utf8 to utf8mb4 to prevent truncation of astral characters - https://phabricator.wikimedia.org/T153899#2894520 (10Bawolff) [16:24:50] 06Operations, 10DBA, 10Gerrit, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896367 (10Paladox) [16:25:18] 06Operations, 10DBA, 10Gerrit, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2644059 (10Paladox) Adding #dba and #operations as this requires changes to the db so it needs there envolvement [16:26:09] (03PS3) 10Paladox: Gerrit: Convert from utf8 to utf8mb4 [puppet] - 10https://gerrit.wikimedia.org/r/328571 (https://phabricator.wikimedia.org/T153899) [16:26:16] (03PS4) 10Paladox: Gerrit: Convert from utf8 to utf8mb4 [puppet] - 10https://gerrit.wikimedia.org/r/328571 (https://phabricator.wikimedia.org/T153899) [16:30:20] (03CR) 10Dzahn: "thanks! btw, i tried to use https but it's broken on that site" [puppet] - 10https://gerrit.wikimedia.org/r/328645 (owner: 10Nemo bis) [16:30:54] (03PS2) 10Dzahn: [Planet Wikimedia] Update .mau. feed URL on Italian planet [puppet] - 10https://gerrit.wikimedia.org/r/328645 (owner: 10Nemo bis) [16:32:26] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896386 (10jcrespo) Can someone check that gerrit works with utf8mb4 before doing a destructive operations? [16:33:05] Reedy: you once used this URL https://github.com/reedy/https-everywhere/commit/f392ad52848d637f39474b2abd13a13735372a72 know why it's 404 nowadays? [16:33:35] No idea... [16:33:38] Replaced commit? [16:33:43] ok [16:33:44] It's not upstream either.. [16:33:45] https://github.com/EFForg/https-everywhere/commit/f392ad52848d637f39474b2abd13a13735372a72 [16:33:48] What are you looking for? :P [16:34:20] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:34:32] Reedy: i was reading https://phabricator.wikimedia.org/T46751#531340 and followed the link, wanted to know if there is still a planet rule in httpseverywehre [16:34:45] and the mixed-content thing [16:34:57] https://github.com/EFForg/https-everywhere/blob/master/src/chrome/content/rules/Wikimedia.xml [16:35:00] what i _actually_ wanted is another ticket that talks about mixed content on planet [16:35:04] thanks [16:37:30] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [16:40:07] Hmm, was that my report on embedded media? [16:40:48] https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2016-12-22/Technology_report [16:40:56] Nemo_bis: it was in relation to https-only [16:41:34] tried to switch as many external planet feeds to https as well [16:41:43] because it embeds images [16:43:50] (03CR) 10Dzahn: [C: 032] [Planet Wikimedia] Update .mau. feed URL on Italian planet [puppet] - 10https://gerrit.wikimedia.org/r/328645 (owner: 10Nemo bis) [16:44:17] Nemo_bis: do you think xmau.com would like to fix https using LE? [16:44:28] might be nice to just give the admin a hint [16:45:25] currently "SSL_ERROR_RX_RECORD_TOO_LONG" / "ERR_SSL_PROTOCOL_ERROR" [16:47:13] (03Abandoned) 10MarcoAurelio: Allow private.dblist wikis to manage more permissions internally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325531 (https://phabricator.wikimedia.org/T152489) (owner: 10MarcoAurelio) [16:47:43] apergos: re: dashboards, the individual hosts are available from the dropdown in https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown for each metric but not in a single graph no [16:47:54] yeah [16:48:14] I do like to be able to look at em all, especially because there's only a handful [16:48:44] workflow pains [16:48:51] (03Abandoned) 10MarcoAurelio: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306443 (https://phabricator.wikimedia.org/T143789) (owner: 10MarcoAurelio) [16:49:20] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:54:17] apergos: hehe it is possible to build a dashboard with all hosts in a single graph heh, though for most clusters with e.g. > 5 hosts it gets too noisy IMO [16:54:44] yeah I'm a special case, it's 3 plus canary, plus two servers [16:54:59] so just about ideal for having them all on one page [16:55:01] meh whatever [16:55:11] mutante: .mau. is one of the founders of Internet in Italy, I'm pretty sure he knows about HTTPS [16:55:28] I don't right now, they are all buried in 'misc' but I was thinking about putting them into their own group for that very reason, see [16:55:34] He has quite a lot of stuff in his domain though and we already use his time in many ways [16:56:37] (03PS1) 10Ema: WIP: varnishreqstats: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328688 (https://phabricator.wikimedia.org/T151643) [16:57:45] Nemo_bis: ok, thanks and yes it sounded very familiar [16:58:49] <_joe_> Nemo_bis: knowing about HTTPS doesn't mean being aware it's broken (as in, the server responds with HTTP over the HTTPS port) [17:00:37] i can send him a quick mail [17:02:20] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:07:04] _joe_: Nemo_bis: looking at the other feeds, another one where https would be nice is https://www.wikimedia.it [17:07:26] guess i'll tell them [17:07:39] this kind of just times out for me [17:07:57] we already know [17:08:02] ah, ok [17:09:40] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:10:27] * _joe_ away [17:17:20] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:30:12] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896493 (10Paladox) @jcrespo hi, I can test it on gerrit-test. What commands do I use to convert them? I did some searching on doing that and runni... [17:37:40] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:39:52] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896517 (10jcrespo) > I did some searching on doing that and running that causes some error about keys being to big. While I haven't predicted that... [17:41:08] (03PS2) 10Tim Landscheidt: Tools: Generate node sets dynamically [puppet] - 10https://gerrit.wikimedia.org/r/328030 [17:41:54] (03CR) 10jerkins-bot: [V: 04-1] Tools: Generate node sets dynamically [puppet] - 10https://gerrit.wikimedia.org/r/328030 (owner: 10Tim Landscheidt) [17:44:00] (03PS3) 10Tim Landscheidt: Tools: Generate node sets dynamically [puppet] - 10https://gerrit.wikimedia.org/r/328030 [17:50:12] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896529 (10Paladox) root@gerrit-test:/home/paladox# mysql -p -BN reviewdb -e "SHOW TABLES" | while read table; do mysql -p reviewdb -e "ALTER TABLE... [17:53:44] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896531 (10jcrespo) > COuld that be because on gerrit-test the tables were created with latin1? No necessarily, previous indexes that took less tha... [18:02:47] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896566 (10Paladox) Oh, I have enabled innodb_large_prefix and same error, I will try it on gerrit-test3 database which is replica of prod gerrit.... [18:09:08] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896572 (10Paladox) Happens on a utf8 database too. [18:19:31] 06Operations, 06Stewards-and-global-tools, 15User-Urbanecm: Special:GlobalUserRights shows an error at meta.wikimedia.org - https://phabricator.wikimedia.org/T153961#2896631 (10Urbanecm) [18:19:43] 06Operations, 06Stewards-and-global-tools, 15User-Urbanecm: Special:GlobalUserRights shows an error at meta.wikimedia.org - https://phabricator.wikimedia.org/T153961#2896644 (10Urbanecm) p:05Triage>03Unbreak! Breaks production. [18:22:22] 06Operations, 06Stewards-and-global-tools, 15User-Urbanecm: Special:GlobalUserRights shows an error at meta.wikimedia.org - https://phabricator.wikimedia.org/T153961#2896631 (10Bawolff) MapCacheLRU::has called with invalid key. Must be string or integer. ``` { "file": "/srv/mediawiki/php-1.29.0-wmf.6/in... [18:23:02] 06Operations, 06Stewards-and-global-tools, 15User-Urbanecm: Special:GlobalUserRights shows an error at meta.wikimedia.org - https://phabricator.wikimedia.org/T153961#2896663 (10greg) [18:23:17] 06Operations, 06Stewards-and-global-tools, 15User-Urbanecm: Special:GlobalUserRights shows an error at meta.wikimedia.org - https://phabricator.wikimedia.org/T153961#2896664 (10Krenair) [18:25:12] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896683 (10Paladox) {P4672} [18:28:29] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896697 (10Paladox) ah, you need to also have rowformat at ROW_FORMAT=DYNAMIC [18:32:10] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:33:06] (03CR) 10Chad: [C: 04-1] "Haven't reviewed, won't review until after the holidays as I'm not around to handle a migration." [puppet] - 10https://gerrit.wikimedia.org/r/328571 (https://phabricator.wikimedia.org/T153899) (owner: 10Paladox) [18:34:26] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896711 (10Paladox) I converted the patch_comments table to utf8mb4, it stops the error but instead of showing the emoji it shows ??? which is bette... [18:35:20] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2896712 (10JKatzWMF) @Nuria @mforns I think having an alternative with the typo'd version makes a lot of sense. These metrics are used as a pro... [18:35:35] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896714 (10jcrespo) >>! In T145885#2896697, @Paladox wrote: > ah, you need to also have rowformat at ROW_FORMAT=DYNAMIC Yes, it requires Barracuda... [18:36:54] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2896722 (10mforns) a:03BBlack [18:37:22] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2732531 (10mforns) Assigned the task to @BBlack , so that he can give his opinion on this. [18:39:17] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896742 (10jcrespo) > instead of showing the emoji it shows ??? which is better then an error At this point I would suggest setting sql_mode=TRADIT... [18:45:56] 06Operations, 07Puppet, 13Patch-For-Review: apache::static_site is not working - https://phabricator.wikimedia.org/T153816#2896763 (10Krenair) [18:47:03] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896775 (10Paladox) Oh so we doint want to support emoji's? [18:47:23] (03CR) 10Dzahn: [C: 04-2] "using -2 to set it to "stalled" until after holidays. per note in commit message about needing db conversion, comment from Chad, code free" [puppet] - 10https://gerrit.wikimedia.org/r/328571 (https://phabricator.wikimedia.org/T153899) (owner: 10Paladox) [18:49:30] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896811 (10demon) Emojis are the best & most important feature in any modern web application 🙃 [18:50:05] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896813 (10jcrespo) >>! In T145885#2896811, @demon wrote: > Emojis are the best & most important feature in any modern web application 🙃 Tell that... [18:51:00] (03CR) 10Dzahn: [C: 04-2] "same here, consider my vote just setting this to "stalled", we won't do this now, but keep it until early next year." [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [18:51:12] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896817 (10Paladox) >>! In T145885#2896813, @jcrespo wrote: >>>! In T145885#2896811, @demon wrote: >> Emojis are the best & most important feature i... [18:52:10] (03PS1) 10Filippo Giunchedi: swift: drain thumbor traffic [puppet] - 10https://gerrit.wikimedia.org/r/328713 (https://phabricator.wikimedia.org/T151851) [18:56:14] (03CR) 10Dzahn: [C: 031] "it's not actually about re-starting, it's just about the initial start after that directoy gets created when an instance is created (from " [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) (owner: 10Paladox) [18:56:38] (03PS5) 10Dzahn: Contint: Notify Service mysql to restart [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) (owner: 10Paladox) [18:56:59] (03CR) 10Dzahn: [C: 031] "not merging now either, but i wanted to update the ticket" [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) (owner: 10Paladox) [18:57:23] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896831 (10Bawolff) >>! In T145885#2896711, @Paladox wrote: > I converted the patch_comments table to utf8mb4, it stops the error but instead of sho... [18:58:07] (03PS6) 10Dzahn: Contint: notify service mysql on creation of mysql dir [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) (owner: 10Paladox) [18:59:27] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896833 (10Paladox) Oh. this {P4674} shows the new update to patch_comments | patch_sets | InnoDB | 10 | Compact | 350... [18:59:41] (03CR) 10Filippo Giunchedi: [C: 032] swift: drain thumbor traffic [puppet] - 10https://gerrit.wikimedia.org/r/328713 (https://phabricator.wikimedia.org/T151851) (owner: 10Filippo Giunchedi) [19:01:10] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:01:39] are there DNS or other connectivity issues going on? I'm finding that my linode server is getting timeouts when connecting to WMF servers, even though my home connections to both WMF servers and linode servers seem fine. [19:02:21] can you share a traceroute? [19:02:39] will work on that shortly. [19:04:50] (03PS2) 10Dzahn: Labs: Remove obsolete code (os_version('ubuntu <= precise')) [puppet] - 10https://gerrit.wikimedia.org/r/326312 (owner: 10Tim Landscheidt) [19:04:57] !log roll restart swift proxy on ms-fe1* to drain thumbor traffic [19:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:03] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896841 (10Paladox) Hmm could this be the problem MariaDB [reviewdb]> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name... [19:05:49] ragesoss: not that I'm aware of [19:06:29] (03CR) 10Dzahn: [C: 031] "Yuvi,Andrew, one more for end-of-year cleanup maybe?" [puppet] - 10https://gerrit.wikimedia.org/r/326312 (owner: 10Tim Landscheidt) [19:09:23] (03CR) 10Dzahn: "looks ok, afaict. but were you able to test between your own labs logstash instance and your labs gerrit instance when they are not on loc" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [19:10:15] (03CR) 10Paladox: "@Dzahn yeh, I tested using wmflabs eqiad name." [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [19:10:23] 06Operations, 10ops-eqiad: update label/racktables visible label for thumbnor100[12] - https://phabricator.wikimedia.org/T153965#2896846 (10RobH) [19:10:34] (03CR) 10Dzahn: "also a bit more commmit message wouldn't hurt. why are we doing it for example" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [19:10:56] ragesoss, seems fine to me [19:11:32] thanks Krenair and chasemp. I will dig in further. [19:12:03] 06Operations, 10ops-eqiad: update label/racktables visible label for thumbnor100[12] - https://phabricator.wikimedia.org/T153965#2896859 (10RobH) I also updated the switch port descriptions, which weren't correct. Mgmt dns had been corrected before I noticed this discrepancy. [19:12:34] (03PS16) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [19:12:41] (03PS17) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [19:15:10] PROBLEM - Swift HTTP frontend on ms-fe3001 is CRITICAL: connect to address 10.20.0.15 and port 80: Connection refused [19:15:10] PROBLEM - Swift HTTP frontend on ms-fe3002 is CRITICAL: connect to address 10.20.0.16 and port 80: Connection refused [19:15:11] PROBLEM - Swift HTTP backend on ms-fe3001 is CRITICAL: connect to address 10.20.0.15 and port 80: Connection refused [19:15:16] (03CR) 10Paladox: "@Chad (ostriches) could you +1 or -1 this please?" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [19:15:20] PROBLEM - Swift HTTP backend on ms-fe3002 is CRITICAL: connect to address 10.20.0.16 and port 80: Connection refused [19:15:24] that's me, esams [19:15:30] godog: :) [19:16:10] RECOVERY - Swift HTTP frontend on ms-fe3001 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.175 second response time [19:16:10] RECOVERY - Swift HTTP backend on ms-fe3001 is OK: HTTP OK: HTTP/1.1 200 OK - 393 bytes in 0.184 second response time [19:17:10] RECOVERY - Swift HTTP frontend on ms-fe3002 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.174 second response time [19:17:20] RECOVERY - Swift HTTP backend on ms-fe3002 is OK: HTTP OK: HTTP/1.1 200 OK - 393 bytes in 0.191 second response time [19:18:40] !log stopping replication on dbstore2001(s2) and db2035 for enwiktionary.templatelinks reimport [19:18:41] !log restarting elasticsearch on relforge100[12] to test ltr plugin [19:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:20] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 353 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 347, number_of_pending_tasks: 14, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 8, task_max_waiting_in_queue_millis: 4153, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: [19:22:50] !log restart wdqs-blazegraph and wdqs-updater on wdqs1001.eqiad.wmnet (suspicious load) [19:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:54] SMalyshev: ^ [19:23:20] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 275, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 361, initial [19:26:50] 06Operations, 10ops-eqiad: update label/racktables visible label for labservices1002/WMF4075 - https://phabricator.wikimedia.org/T153967#2896888 (10RobH) [19:27:40] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896901 (10Paladox) ahaha it works now. you have to set character-set-client-handshake = FALSE character-set-server = utf8mb4 collation-server = u... [19:30:27] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2896906 (10Nuria) @JKatzWMF Do ping @BBlack about the impact of the change in your metrics, on our end there are no code changes needed to proces... [19:31:26] (03CR) 10Dzahn: "is the misc-web refactoring ongoing?" [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [19:46:24] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896928 (10Paladox) >>! In T145885#2896517, @jcrespo wrote: >> I did some searching on doing that and running that causes some error about keys bein... [19:48:29] (03CR) 10Dzahn: [C: 04-1] "for idn in xn*; do echo $idn; dig NS $idn | grep -A3 "ANSWER SECTION"; done" [dns] - 10https://gerrit.wikimedia.org/r/328604 (https://phabricator.wikimedia.org/T137105) (owner: 10Dzahn) [19:48:37] (03PS1) 10Urbanecm: Add DW alias for NS_PROJECT_TALK in frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328719 (https://phabricator.wikimedia.org/T153952) [19:49:23] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2896945 (10Paladox) Most of the table has been converted and I carn't see any problems, it's working as before but now emoji's work :) [19:50:02] "Non-Commercial Partnership of promotion of distribution of encyclopedic knowledge Wikimedia RU" [19:50:17] ^ that's just regular Wikimedia Russia? [19:50:39] registrar: RUCENTER-RF org: "Non-Commercial Partnership of promotion of distribution of encyclopedic knowledge Wikimedia RU" [19:50:57] in a domain whois [19:52:40] My problem resolved itself, while I was running traceroute (or probably just before). But for ~45 mins, there was some sort of problem that results in timeouts whenever my app on linode tried to get an OAuth token or otherwise connect to Wikipedia. [19:53:06] Thanks, Krenair and chasemp. [19:53:14] I did nothing [19:53:35] you answered my question about whether there were known problems, which was helpful! [19:53:46] ok :) [19:59:23] !log restarting elasticsearch (again) on relforge100[12] to test ltr plugin [19:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:30] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2896977 (10JKatzWMF) Ok thanks! [20:03:05] (03PS3) 10Dzahn: ganglia: switch eqiad aggregator from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328599 (https://phabricator.wikimedia.org/T123733) [20:03:42] (03Abandoned) 10Dzahn: move ganglia aggregator eqiad from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328450 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [20:05:18] (03PS4) 10Dzahn: ganglia: switch eqiad aggregator from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328599 (https://phabricator.wikimedia.org/T123733) [20:06:01] (03PS5) 10Dzahn: ganglia: switch eqiad aggregator from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328599 (https://phabricator.wikimedia.org/T123733) [20:07:15] (03CR) 10Dzahn: [C: 04-1] "yep, let's continue on this next year, needs some more thought." [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [20:08:36] (03CR) 10Chad: "Functionally ok, inline nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [20:10:46] (03CR) 10Dzahn: "@Andrew, is this the part where you said it should stay but only "ironic" (hardware labs) would use it, not instances/VMs?" [puppet] - 10https://gerrit.wikimedia.org/r/328597 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [20:11:55] (03CR) 10Dzahn: "if we have a general consensus on the shell script extension thing.. i'll compile this to proof it's no-op on einsteinium/tegmen..." [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [20:12:58] (03CR) 10Dzahn: [C: 031] "it's all just about the .sh file extension being added.. and all they do is create login banners" [puppet] - 10https://gerrit.wikimedia.org/r/327673 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [20:15:25] (03CR) 10Dzahn: [C: 04-1] "what i would like to know here is: should apt.wikimedia.org.conf.erb be in modules "aptrepo" rather than "install_server", yes or no? and" [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [20:16:10] (03CR) 10Dzahn: [C: 031] "ok thanks, waiting until after you are back" [puppet] - 10https://gerrit.wikimedia.org/r/318451 (owner: 10Dzahn) [20:17:59] (03PS18) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [20:18:27] (03CR) 10Paladox: "@Chad done :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [20:21:10] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2897016 (10jcrespo) > ahaha it works now. > > you have to set > > character-set-client-handshake = FALSE > character-set-server = utf8mb4 > collat... [20:22:51] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2897017 (10Paladox) Oh, that should work too. [20:29:31] 06Operations: spare/unused disks on application servers - https://phabricator.wikimedia.org/T106381#1466972 (10Dzahn) updated list: ``` [neodymium:~] $ sudo salt -b 50 -t 20 --output=raw mw* cmd.run 'grep -q sdb /etc/fstab || grep sdb$ /proc/partitions' | grep sdb {'mw2157.codfw.wmnet': ' 8 16 4883865... [20:31:07] 06Operations: reinstall rcs100[12] with RAID - https://phabricator.wikimedia.org/T140441#2897037 (10Dzahn) p:05High>03Normal [20:31:33] 06Operations: reinstall rcs100[12] with RAID - https://phabricator.wikimedia.org/T140441#2464918 (10Dzahn) Looks like rcstream will be replaced by a new service anyways. right [20:31:52] 06Operations: reinstall rcs100[12] with RAID - https://phabricator.wikimedia.org/T140441#2897039 (10Dzahn) Or should we turn it into "move rcs into ganeti VMs"? [20:32:27] (03PS2) 10Alex Monk: graphite: Don't use wikitech API to find labs projects/instances [puppet] - 10https://gerrit.wikimedia.org/r/328608 (https://phabricator.wikimedia.org/T104575) [20:33:47] 06Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#2897044 (10Dzahn) @elukey i see on rdb1005 for example it has a sda2 entirely used for /tmp, so 20G temp. That reminded me of the 2 videoscalers we reinstalled recently. is this the same partman recipe issue here maybe? [20:36:01] (03PS2) 10Alex Monk: labstore: Don't use wikitech API to find labs instances in nfs-exportd [puppet] - 10https://gerrit.wikimedia.org/r/328609 (https://phabricator.wikimedia.org/T104575) [20:37:38] (03PS2) 10Alex Monk: Move shinkengen from using LDAP to the OpenStack APIs [puppet] - 10https://gerrit.wikimedia.org/r/328611 [20:39:19] !log catrope@tin Started scap: Sync Idf4618977f172 in the OAuth extension- [20:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:44] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikimedia-Apache-configuration: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164#2897064 (10Dzahn) Nowadays: curl https://integration.wikimedia.org/cover/cdb

The document has moved (03CR) 10Jcrespo: [C: 032] Disable l10nupdate cron [puppet] - 10https://gerrit.wikimedia.org/r/328738 (owner: 10Thcipriani) [22:11:15] !log disable l10nupdate cron for deployment freeze [22:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:51] (03PS1) 10Thcipriani: Revert "Disable l10nupdate cron" [puppet] - 10https://gerrit.wikimedia.org/r/328839 [22:16:26] (03CR) 10Thcipriani: [C: 04-1] "Not until after the holidays" [puppet] - 10https://gerrit.wikimedia.org/r/328839 (owner: 10Thcipriani) [22:23:20] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:28:52] (03PS3) 10Tim Landscheidt: apache: Fix some issues with apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816) [22:28:54] (03PS3) 10Tim Landscheidt: [WIP] aptly: Make aptly work with Apache [puppet] - 10https://gerrit.wikimedia.org/r/328467 (https://phabricator.wikimedia.org/T153814) [22:32:36] (03PS1) 10Filippo Giunchedi: prometheus: extend ops recording rules [puppet] - 10https://gerrit.wikimedia.org/r/328842 [22:33:08] (03CR) 10jerkins-bot: [V: 04-1] prometheus: extend ops recording rules [puppet] - 10https://gerrit.wikimedia.org/r/328842 (owner: 10Filippo Giunchedi) [22:36:36] (03CR) 10Filippo Giunchedi: "LGTM, is there a preview of metrics that will be pushed ?" [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [22:41:27] (03CR) 10Tim Landscheidt: "Tested with (inter alia) "restricted_to => 'a'" and "restricted_to => ['b', 'c']". Technically, the puppetmaster would need to be restart" [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816) (owner: 10Tim Landscheidt) [22:46:59] (03CR) 10Jcrespo: [C: 04-1] "Can I block this temporarily (not based on the patch, but on the fact that there is a potential contention on the heartbeat table:" [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [22:48:34] 06Operations: Lost session data on every save attempt - https://phabricator.wikimedia.org/T153984#2897429 (10Tacsipacsi) p:05Triage>03Unbreak! [23:03:07] 06Operations: Lost session data on every save attempt - https://phabricator.wikimedia.org/T153984#2897418 (10jcrespo) @Tacsipacsi, have you tried logging out, clearing your Wikipedia cookies and logging in again, and seeing if it keeps happening? [23:07:09] (03PS19) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [23:14:50] (03CR) 10Filippo Giunchedi: [C: 04-1] "See comment about from address." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328673 (https://phabricator.wikimedia.org/T153167) (owner: 10Gilles) [23:34:20] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [23:53:26] (03CR) 10Alex Monk: "caused T153987" [puppet] - 10https://gerrit.wikimedia.org/r/325949 (owner: 10Rush) [23:53:41] (03CR) 10Filippo Giunchedi: [C: 04-1] "The script itself looks good to me modulo inline comments. I'm not sure about the advantages of base::crond vs puppet's cron resource" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [23:59:17] (03CR) 10Volans: [C: 04-1] "I don't really see the point of the crond module. Detailed comments inline." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac)