[00:07:57] PROBLEM - Druid historical on druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server historical [00:07:58] PROBLEM - Check systemd state on druid1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:36:27] RECOVERY - Druid historical on druid1006 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server historical [00:36:27] RECOVERY - Check systemd state on druid1006 is OK: OK - running: The system is fully operational [02:55:58] (03PS1) 10Andrew Bogott: dnsleaks.py: ignore things under .svc.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/379699 [02:56:50] (03CR) 10Andrew Bogott: [C: 032] dnsleaks.py: ignore things under .svc.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/379699 (owner: 10Andrew Bogott) [03:00:47] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:17:27] (03PS1) 10Andrew Bogott: dnsleaks.py: use case-insensitive comparisons [puppet] - 10https://gerrit.wikimedia.org/r/379700 [03:18:17] (03CR) 10Andrew Bogott: [C: 032] dnsleaks.py: use case-insensitive comparisons [puppet] - 10https://gerrit.wikimedia.org/r/379700 (owner: 10Andrew Bogott) [03:30:17] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [03:35:21] (03CR) 10BryanDavis: [C: 04-1] "The existing array syntax is fine, the problem is that the IP address given is incorrect." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379661 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [05:48:21] Can anyone unbreak jenkins? https://integration.wikimedia.org/zuul/ [06:03:45] hmm, it looks like gearman is stuck [06:31:50] 10Operations, 10cloud-services-team (Kanban): puppet ca_server confusion - https://phabricator.wikimedia.org/T176437#3626230 (10Joe) If you want to better understand what puppet_ca does on an agent, and why removing it afterwards "doesn't break anything" there are good reads in the puppet docs: - https://docs... [06:32:53] !log installing emacs security updates on trusty (Debian already fixed) [06:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:35] 10Operations, 10Discovery, 10Maps-Sprint, 10Maps (Kartographer), and 2 others: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3626231 (10MoritzMuehlenhoff) [06:36:10] 10Operations, 10Discovery, 10Maps-Sprint, 10Maps (Kartographer), and 2 others: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3434983 (10MoritzMuehlenhoff) 05Open>03Resolved >>! In T170548#3625373, @Gehel wrote: > maps is finally upgraded to nodejs 6.11. > > @MoritzMuehlenhoff: according to t... [06:53:13] (03CR) 10Muehlenhoff: [C: 031] "Looks fine, but is there a reason why this only adds ipv4 addresses?" [puppet] - 10https://gerrit.wikimedia.org/r/379559 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [07:08:53] (03PS2) 10Muehlenhoff: Remove salt minion Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/379500 [07:12:58] (03CR) 10jerkins-bot: [V: 04-1] Remove salt minion Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/379500 (owner: 10Muehlenhoff) [07:15:11] (03CR) 10Mobrovac: [C: 031] Configure agent to export Cassandra histogram metrics [puppet] - 10https://gerrit.wikimedia.org/r/379610 (https://phabricator.wikimedia.org/T171772) (owner: 10Eevans) [07:15:13] (03Abandoned) 10Giuseppe Lavagetto: puppet: switch all production hosts to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/379492 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [07:20:17] PROBLEM - SSH on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:42] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I don't understand why my previous comments haven't been taken into account at all." [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [07:23:08] RECOVERY - SSH on copper is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [07:27:57] (03CR) 10Muehlenhoff: [V: 032 C: 032] Remove salt minion Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/379500 (owner: 10Muehlenhoff) [07:32:04] (03CR) 10Legoktm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/379500 (owner: 10Muehlenhoff) [07:37:17] 10Operations, 10monitoring, 10Patch-For-Review: add pdu redundancy checking to server/router/switch checks in icinga - https://phabricator.wikimedia.org/T109903#3626264 (10faidon) >>! In T109903#3625304, @herron wrote: > Check_ipmi_sensor is showing failures on 3 out of 4 of the Dell PowerEdge R620 class sys... [07:37:45] (03CR) 10Paladox: "> I don't understand why my previous comments haven't been taken" [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [07:42:20] (03CR) 10Paladox: "I’m not sure how to get this moving along. Maybe we should change the priority of the task to normal unless we can get this change moving " [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [07:47:41] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3626266 (10Paladox) The patch has stalled and dosent look like it will move along, I guess we should change the priority to no... [07:54:37] (03PS2) 10Elukey: network::constants: add aqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/379559 (https://phabricator.wikimedia.org/T176223) [07:55:50] <_joe_> win 17 [07:55:53] (03CR) 10Elukey: "> Looks fine, but is there a reason why this only adds ipv4" [puppet] - 10https://gerrit.wikimedia.org/r/379559 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [07:57:27] (03PS6) 10Paladox: Phabricator: Fix aphlict systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) [07:58:07] (03CR) 10Paladox: "@Giuseppe Lavagetto I’ve made it forking now." [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [08:26:35] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3626293 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` mw1322.eqiad.wmnet ``` The log can be foun... [08:26:37] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3626294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1322.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1322.eqiad.wmnet'] ``` [08:28:34] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3626295 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` mw1322.eqiad.wmnet ``` The log can be foun... [08:29:00] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 13 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:31:20] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 4 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [08:42:41] (03CR) 10Hashar: [C: 04-1] "I gave it a try on integration-slave-docker-1001 and it fails :(" [puppet] - 10https://gerrit.wikimedia.org/r/379556 (https://phabricator.wikimedia.org/T176267) (owner: 10Hashar) [08:43:04] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/379559 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [08:46:34] (03CR) 10Hashar: [C: 031] "That worked just fine on the labs instances :] The Docker hosts on labs are now all on Docker 17.06 \o/" [puppet] - 10https://gerrit.wikimedia.org/r/379556 (https://phabricator.wikimedia.org/T176267) (owner: 10Hashar) [08:47:09] (03CR) 10Elukey: [C: 032] network::constants: add aqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/379559 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [08:52:05] (03CR) 10Hashar: [C: 031] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/379556 (https://phabricator.wikimedia.org/T176267) (owner: 10Hashar) [08:56:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A couple more small issues that I'd like to see fixed, apart from that LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [09:02:51] (03PS2) 10Muehlenhoff: Remove salt minion packages in production [puppet] - 10https://gerrit.wikimedia.org/r/379525 [09:03:15] (03CR) 10jerkins-bot: [V: 04-1] Remove salt minion packages in production [puppet] - 10https://gerrit.wikimedia.org/r/379525 (owner: 10Muehlenhoff) [09:04:47] (03PS3) 10Muehlenhoff: Remove salt minion packages in production [puppet] - 10https://gerrit.wikimedia.org/r/379525 [09:10:05] (03CR) 10Alexandros Kosiaris: "I don't think this is the right approach. aqs hosts are in no way "special hosts", nor should they be treated that way. A better approach " [puppet] - 10https://gerrit.wikimedia.org/r/379559 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [09:14:10] !log stop mariadb at db1055 for upgrade and maintenance [09:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:11] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3626360 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1322.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1322.eqiad.wmnet'] ``` [09:22:41] (03CR) 10Alexandros Kosiaris: [C: 04-1] contint: docker-ce on labs docker slaves (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379556 (https://phabricator.wikimedia.org/T176267) (owner: 10Hashar) [09:25:44] (03PS2) 10Giuseppe Lavagetto: Convert to use of the future parser by default [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/379569 (https://phabricator.wikimedia.org/T171704) [09:34:34] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3626374 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` mw1321.eqiad.wmnet ``` The log can be foun... [09:35:57] (03PS4) 10Muehlenhoff: Stop using salt minion in production [puppet] - 10https://gerrit.wikimedia.org/r/379525 [09:41:42] 10Operations, 10Edit-Review-Improvements, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10Performance: Systematically test load speeds of Watchlist and Recent Changes - https://phabricator.wikimedia.org/T176445#3626375 (10mark) [09:42:27] !log mw1319 (new appserver) serving traffic (going to increase its weight up to 20) [09:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:58] (03PS7) 10Paladox: Phabricator: Fix aphlict systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) [09:57:01] (03CR) 10Paladox: Phabricator: Fix aphlict systemd script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [09:57:44] (03PS9) 10Paladox: Phabricator: Fix aphlict systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) [09:59:35] (03PS3) 10Giuseppe Lavagetto: Convert to use of the future parser by default [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/379569 (https://phabricator.wikimedia.org/T171704) [10:03:18] (03PS1) 10Muehlenhoff: Stop including a salt master in the cluster management role [puppet] - 10https://gerrit.wikimedia.org/r/379712 [10:03:20] (03PS1) 10Muehlenhoff: Remove obsolete role::salt::masters::production class [puppet] - 10https://gerrit.wikimedia.org/r/379713 [10:03:47] (03PS1) 10Addshore: Add AdvancedSearch to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379714 [10:03:49] (03PS1) 10Addshore: Enable AdvancedSearch on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379715 [10:04:11] (03CR) 10Addshore: [C: 04-1] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379714 (owner: 10Addshore) [10:04:15] (03CR) 10Addshore: [C: 04-1] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379715 (owner: 10Addshore) [10:05:11] (03CR) 10Giuseppe Lavagetto: [C: 032] Convert to use of the future parser by default [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/379569 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [10:11:05] (03CR) 10Ema: [C: 032] bgp: FSM can be in states != ST_IDLE when the connection is closed [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/379570 (https://phabricator.wikimedia.org/T173028) (owner: 10Ema) [10:13:40] PROBLEM - SSH on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:13:47] <_joe_> what's up with copper? [10:14:12] <_joe_> moritzm: are you logged in? I can't login in fact [10:14:33] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=copper&refresh=1m&orgId=1 [10:14:36] it looks a bit overloaded [10:14:44] _joe_: there's an icinga critical for copper SSH [10:14:54] <_joe_> ema: that's what I was responding to indeed [10:15:00] <_joe_> and yes, it seems overloaded [10:15:05] <_joe_> I was asking myself by what [10:15:13] oh yeah I see :) I can't ssh currently [10:15:25] now I'm in [10:15:26] <_joe_> it already happened at 7 AM utc [10:15:31] RECOVERY - SSH on copper is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [10:15:36] <_joe_> I'm what was going on [10:17:13] don't think it's related to the hhvm build, that was idling due to a build error [10:32:39] (03PS6) 10ArielGlenn: Move dataset rsync config manifests to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379668 (https://phabricator.wikimedia.org/T175528) [10:33:24] (03CR) 10ArielGlenn: [C: 032] Move dataset rsync config manifests to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379668 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [10:38:48] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: bump to version 0.3.4 [puppet] - 10https://gerrit.wikimedia.org/r/379717 (https://phabricator.wikimedia.org/T171704) [10:39:09] (03PS2) 10Giuseppe Lavagetto: puppet-compiler: bump to version 0.3.4 [puppet] - 10https://gerrit.wikimedia.org/r/379717 (https://phabricator.wikimedia.org/T171704) [10:39:21] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] puppet-compiler: bump to version 0.3.4 [puppet] - 10https://gerrit.wikimedia.org/r/379717 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [10:41:26] (03CR) 10Giuseppe Lavagetto: [C: 032] "PCC looks good https://puppet-compiler.wmflabs.org/compiler03/7987/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [10:41:48] (03PS10) 10Giuseppe Lavagetto: Phabricator: Fix aphlict systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [10:42:17] (03PS1) 10Jcrespo: Repool db1055 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379718 [10:43:13] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621785.68 seconds [10:43:50] <_joe_> paladox: applying in a minute to phab1001 [10:44:14] <_joe_> thanks for working on this [10:44:17] (03CR) 10Jcrespo: [C: 04-1] "Not until the buffer pool warms up: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=1&fullscreen&orgId=1&var-dc=eqiad%20prometheu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379718 (owner: 10Jcrespo) [10:45:25] <_joe_> lol @ puppet [10:45:55] <_joe_> it's telling me aphlict failed, but it hasn't [10:46:04] <_joe_> actually, the fix worked very well [10:46:21] <_joe_> paladox: good job [10:47:13] PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[aphlict] [10:47:13] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621923.97 seconds [10:48:10] (03CR) 10Hashar: [C: 031] contint: docker-ce on labs docker slaves (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379556 (https://phabricator.wikimedia.org/T176267) (owner: 10Hashar) [10:48:17] <_joe_> that failure on phab1001 is bogus [10:48:54] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3626648 (10Joe) Thanks to @Paladox work on this, the aphlict service unit now handles correctly the software. I am going to m... [10:49:13] RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [10:49:14] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3626649 (10Joe) 05Open>03Resolved a:03Paladox [10:49:52] PROBLEM - SSH on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:50:39] mmh, again [10:50:52] RECOVERY - SSH on copper is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [10:52:00] (03PS1) 10Sbisson: RCFilters: cleanup unused variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379719 [10:54:12] (03PS1) 10Elukey: profile::kafka::broker: add the cluster label to the prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/379720 (https://phabricator.wikimedia.org/T175922) [10:58:15] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7989/ looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/379720 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [11:05:39] (03CR) 10Elukey: "Example of kafka metrics in here:" [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [11:05:47] 10Operations, 10hardware-requests: New package builder host - https://phabricator.wikimedia.org/T176472#3626672 (10MoritzMuehlenhoff) [11:15:23] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 622276.83 seconds [11:33:45] `. [11:33:47] `. [11:34:08] heh [11:40:22] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3626730 (10Jseddon) Hey @BBlack, Been working on this over the last week. The short: We have HSTS but its set to 90 days. Shopify have confirmed that this can be extended in le... [11:45:36] 10Puppet, 10Trebuchet: Trebuchet master should be separate from scap - https://phabricator.wikimedia.org/T96042#3626742 (10MoritzMuehlenhoff) 05Open>03declined Trebuchet has been removed. [11:50:13] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:50:39] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3626762 (10Paladox) @Joe thanks :) Yeh we can remove Ubuntu / upstart support. [12:00:27] (03PS2) 10Hashar: contint: docker-ce on labs docker slaves [puppet] - 10https://gerrit.wikimedia.org/r/379556 (https://phabricator.wikimedia.org/T176267) [12:00:29] (03PS1) 10Hashar: Decouple profile::ci::docker and arcanist install [puppet] - 10https://gerrit.wikimedia.org/r/379726 (https://phabricator.wikimedia.org/T176267) [12:00:33] (03PS1) 10Hashar: Decouple profile::ci::docker and zuul-cloner install [puppet] - 10https://gerrit.wikimedia.org/r/379727 (https://phabricator.wikimedia.org/T176267) [12:00:35] (03PS1) 10Hashar: Decouple profile::ci::docker and worker_localhost [puppet] - 10https://gerrit.wikimedia.org/r/379728 (https://phabricator.wikimedia.org/T176267) [12:00:37] (03PS1) 10Hashar: Move jenkins agent username to hiera [puppet] - 10https://gerrit.wikimedia.org/r/379729 [12:02:37] !log apt-get upgrade on contint1001 / contint2001 [12:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:26] (03PS1) 10Milimetric: [WIP] Add druid options to AQS config [puppet] - 10https://gerrit.wikimedia.org/r/379730 [12:17:23] RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:19:47] (03CR) 10Jcrespo: [C: 032] Repool db1055 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379718 (owner: 10Jcrespo) [12:20:12] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379731 [12:23:36] (03Merged) 10jenkins-bot: Repool db1055 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379718 (owner: 10Jcrespo) [12:24:01] (03CR) 10jenkins-bot: Repool db1055 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379718 (owner: 10Jcrespo) [12:26:48] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3626861 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1321.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1321.eqiad.wmnet'] ``` [12:30:23] did I miss the log with the deployment log or did it actually show? [12:31:32] I think some of the bots stopped working, not sure which ones: https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:37] (03PS1) 10Elukey: profile::kafka::broker: remove graphite metrics config [puppet] - 10https://gerrit.wikimedia.org/r/379734 (https://phabricator.wikimedia.org/T175922) [12:34:50] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3626867 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` mw1320.eqiad.wmnet ``` The log can be foun... [12:36:22] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7990/" [puppet] - 10https://gerrit.wikimedia.org/r/379734 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [12:42:55] (03PS1) 10Volans: wmf-auto-reimage: bugfix variable reference [puppet] - 10https://gerrit.wikimedia.org/r/379742 [12:44:42] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:45:42] (03PS2) 10Volans: wmf-auto-reimage: bugfix variable reference [puppet] - 10https://gerrit.wikimedia.org/r/379742 [12:46:58] (03CR) 10Volans: [C: 032] wmf-auto-reimage: bugfix variable reference [puppet] - 10https://gerrit.wikimedia.org/r/379742 (owner: 10Volans) [13:01:51] (03PS2) 10Hashar: Move jenkins agent username to hiera [puppet] - 10https://gerrit.wikimedia.org/r/379729 [13:06:21] (03CR) 10Hashar: "For production hosts, the puppet compiler seems all happy about it https://puppet-compiler.wmflabs.org/compiler02/7991/ :]" [puppet] - 10https://gerrit.wikimedia.org/r/379729 (owner: 10Hashar) [13:07:51] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3626970 (10BBlack) Thanks for the updates! Even a 90d HSTS without the preload/includeSub flags is better than nothing. If we can get the time extended out to 1y that's even be... [13:09:40] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3626971 (10BBlack) [13:13:03] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:16:38] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review, 10Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3626988 (10MoritzMuehlenhoff) [13:16:46] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379750 [13:16:55] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379750 (owner: 10Jcrespo) [13:17:16] (03Abandoned) 10Jcrespo: Revert "mariadb: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379731 (owner: 10Jcrespo) [13:17:42] (03Abandoned) 10Jcrespo: Revert "mariadb: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379750 (owner: 10Jcrespo) [13:21:46] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3626995 (10jcrespo) @mmodell This is still needed, but this and the next week are going to be problematic. As a heads up, we may need to merge some puppet changes s... [13:22:52] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3626998 (10mmodell) @jcrespo: Thanks, I'll keep an eye out for it. [13:25:52] (03PS4) 10Zoranzoki21: Fix problem with throttle rule for John Michael Kohler Art Center. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379661 (https://phabricator.wikimedia.org/T176287) [13:27:05] (03PS5) 10Zoranzoki21: Fix problem with throttle rule for John Michael Kohler Art Center. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379661 (https://phabricator.wikimedia.org/T176287) [13:27:11] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#3627013 (10MoritzMuehlenhoff) >>! In T175361#3621879, @herron wrote: > # Provision a mx2001 replacement, say mx2002, test it and then cut the public IPs of mx2001 over to mx2002. Potentially rename it back to mx... [13:30:12] (03PS1) 10Jcrespo: Pool db1101 as recentchanges replica for s2 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379756 (https://phabricator.wikimedia.org/T176311) [13:32:01] (03PS1) 10Jcrespo: Pool db1055 with full weight, remove main traffic from rc replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379757 [13:45:20] 10Operations, 10Discovery, 10Maps-Sprint, 10Maps (Kartographer), and 2 others: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3627066 (10debt) [13:45:25] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3627064 (10debt) 05Open>03Resolved Woohoo! 🎉 [13:49:16] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, and 2 others: Make maps active / active - https://phabricator.wikimedia.org/T162362#3627067 (10debt) 05Open>03Resolved Thanks @BBlack and @Gehel ! [13:49:45] !log mw1321 (new appserver) serving traffic (going to increase its weight up to 20) [13:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:46] 10Operations, 10cloud-services-team (Kanban): puppet ca_server confusion - https://phabricator.wikimedia.org/T176437#3627211 (10Andrew) As far as I can see, the docs only describe setting ca_server once, for agents, in the [main] block. I am missing an explanation of why we would set it twice, and what settin... [14:15:31] (03PS1) 10Muehlenhoff: Remove role::salt::masters::labs::project_master [puppet] - 10https://gerrit.wikimedia.org/r/379763 [14:32:08] (03CR) 10Jdlrobson: "VolkerE Yeh this just needs a SWAT. Ping me if you need help learning about how to do that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377406 (https://phabricator.wikimedia.org/T175670) (owner: 10VolkerE) [14:42:48] !log updated tor packages to 0.3.1.7 (new stable series) [14:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:29] !log uploaded php-luasandbox build for src:php5.5 (required for CI tests on jessie) [14:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:46] ^ hashar: that should've been the last one. when you have a patch to switch the tests to apt.wikimedia.org, add me to reviewers and I'll look into merging it [14:45:16] (03CR) 10Jcrespo: [C: 032] Pool db1101 as recentchanges replica for s2 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379756 (https://phabricator.wikimedia.org/T176311) (owner: 10Jcrespo) [14:46:51] (03Merged) 10jenkins-bot: Pool db1101 as recentchanges replica for s2 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379756 (https://phabricator.wikimedia.org/T176311) (owner: 10Jcrespo) [14:47:02] (03CR) 10jenkins-bot: Pool db1101 as recentchanges replica for s2 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379756 (https://phabricator.wikimedia.org/T176311) (owner: 10Jcrespo) [14:49:56] !log jynus@tin Synchronized wmf-config/db-eqiad.php: repool db1101 with low weight (duration: 00m 47s) [14:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:15] 10Operations, 10fundraising-tech-ops, 10netops: move frav1001's to the frack-fundraising VLAN so we can use it for database testing - https://phabricator.wikimedia.org/T176492#3627380 (10Jgreen) [14:50:33] 10Operations, 10fundraising-tech-ops, 10netops: move frav1001's to the frack-fundraising VLAN so we can use it for database testing - https://phabricator.wikimedia.org/T176492#3627397 (10Jgreen) [14:50:35] 10Operations, 10fundraising-tech-ops, 10netops: move frav1001's to the frack-fundraising VLAN so we can use it for database testing - https://phabricator.wikimedia.org/T176492#3627399 (10Jgreen) a:05Jgreen>03None [14:50:49] 10Operations, 10fundraising-tech-ops, 10netops: move frav1001's to the frack-fundraising VLAN so we can use it for database testing - https://phabricator.wikimedia.org/T176492#3627380 (10Jgreen) p:05High>03Triage [14:51:59] (03PS1) 10Muehlenhoff: Remove role::salt::masters::labs from labcontrol* hosts [puppet] - 10https://gerrit.wikimedia.org/r/379770 [15:05:22] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3627421 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1320.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1320.eqiad.wmnet'] ``` [15:14:42] 10Operations, 10hardware-requests: New package builder host - https://phabricator.wikimedia.org/T176472#3626672 (10RobH) This seems like it doesn't need much space on the disks, the smallest spare eqiad system I have that meets the other requirements (32GB RAM), we have a few options. We have an older spare W... [15:17:21] 10Operations, 10hardware-requests: New package builder host - https://phabricator.wikimedia.org/T176472#3627478 (10MoritzMuehlenhoff) WMF4727 sounds like a pretty good fit (if we can swap copper's SSD drives in there (since they currently have SATA)?) [15:21:42] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#3627483 (10akosiaris) >>! In T175361#3621879, @herron wrote: > Looking more closely at how to pull mx2001 out of service for an OS reload it is more complicated than I originally thought. We have ~100 dns zones... [15:21:53] !log Restarted Jenkins. Out of memory) [15:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:46] 10Operations, 10hardware-requests: New package builder host - https://phabricator.wikimedia.org/T176472#3627486 (10RobH) Copper is a very old R310, which has cabled HDD with LFF bays. The SFF SDDs fit in, since it is a non-hot-swap chassis. If we want to move the old SSDs from copper into the new host, it wi... [15:27:47] 10Operations, 10hardware-requests: New package builder host - https://phabricator.wikimedia.org/T176472#3627493 (10MoritzMuehlenhoff) Or maybe let's go ahead with the SATAs as currently used in WMF4727 (which still is a much faster system than copper). Package building isn't the most I/O bound task we're runni... [15:31:07] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 30 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [15:33:50] oh look ripe-atlas-codfw again [15:36:07] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [15:42:00] 10Operations, 10hardware-requests: New package builder host - https://phabricator.wikimedia.org/T176472#3627551 (10akosiaris) IIRC we opened T130759 because slow IO had indeed cause some minor suffering on our part. If we can avoid migrating back to SATA disks easily I think we should. There's one more option... [15:42:17] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:23] 10Operations, 10hardware-requests: New package builder host - https://phabricator.wikimedia.org/T176472#3627554 (10MoritzMuehlenhoff) My (mild) concern against a Ganeti VM is that some packages might build differently if they detect virtualisation (via systemd-detect-virt or whatever). Not sure if that's an is... [15:56:29] 10Operations, 10HHVM, 10User-Elukey: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3627558 (10elukey) [15:58:07] _joe_ your welcome :). Just got home and saw your irc messages. [16:04:50] 10Operations, 10fundraising-tech-ops, 10netops: move frav1001's to the frack-fundraising VLAN so we can use it for database testing - https://phabricator.wikimedia.org/T176492#3627580 (10Jgreen) This also requires an updating to the firewall policy, I added the new database and generated the new policy. com... [16:10:27] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:17:58] (03PS1) 10Jgreen: switch fundraising test box hostname from frav1001 to frdb1003, adjust IP for new subnet [dns] - 10https://gerrit.wikimedia.org/r/379782 (https://phabricator.wikimedia.org/T176492) [16:31:17] (03CR) 10Jgreen: [C: 032] switch fundraising test box hostname from frav1001 to frdb1003, adjust IP for new subnet [dns] - 10https://gerrit.wikimedia.org/r/379782 (https://phabricator.wikimedia.org/T176492) (owner: 10Jgreen) [16:36:42] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: move frav1001's to the frack-fundraising VLAN so we can use it for database testing - https://phabricator.wikimedia.org/T176492#3627694 (10ayounsi) a:03ayounsi Vlan changed on pfw-eqiad (old) Vlan changed on fasw-c-eqiad (new) Security p... [16:39:22] 10Operations, 10Goal, 10Kubernetes: Operations Q1 goal: Streamlined Service Delivery - https://phabricator.wikimedia.org/T170108#3627729 (10akosiaris) [16:39:24] 10Operations, 10Goal, 10Kubernetes: Experiment with ingress solutions (stretch) - https://phabricator.wikimedia.org/T170121#3627726 (10akosiaris) 05Open>03Resolved a:03akosiaris Here's my experimentation results. = Intro = Ingress resources are just a resource declaration in the kubernetes API, they n... [16:41:26] (03CR) 10Alexandros Kosiaris: [C: 031] "I 'll merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/379556 (https://phabricator.wikimedia.org/T176267) (owner: 10Hashar) [16:45:20] (03PS1) 10Andrew Bogott: labtest: don't override the labtest puppetmaster ca_server [puppet] - 10https://gerrit.wikimedia.org/r/379788 [16:46:08] (03PS2) 10Andrew Bogott: labtest: don't override the labtest puppetmaster ca_server [puppet] - 10https://gerrit.wikimedia.org/r/379788 [16:46:39] (03CR) 10Andrew Bogott: [C: 032] labtest: don't override the labtest puppetmaster ca_server [puppet] - 10https://gerrit.wikimedia.org/r/379788 (owner: 10Andrew Bogott) [16:49:12] (03PS1) 10ArielGlenn: move fetches of various datasets to dump module from datasets module [puppet] - 10https://gerrit.wikimedia.org/r/379790 (https://phabricator.wikimedia.org/T175528) [16:49:36] (03CR) 10jerkins-bot: [V: 04-1] move fetches of various datasets to dump module from datasets module [puppet] - 10https://gerrit.wikimedia.org/r/379790 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [16:50:59] (03CR) 10ArielGlenn: "I was thinking to collect all these from different parts of the dumps module (where these manifests now are) and pass them in at the profi" [puppet] - 10https://gerrit.wikimedia.org/r/379517 (owner: 10Reedy) [16:51:57] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: move frav1001's to the frack-fundraising VLAN so we can use it for database testing - https://phabricator.wikimedia.org/T176492#3627748 (10ayounsi) 05Open>03Resolved new policy file worked fine, committed. Don't forget to update rackta... [17:03:05] (03PS1) 10Giuseppe Lavagetto: Add support for opinionated build containers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379792 [17:03:09] (03PS1) 10Giuseppe Lavagetto: [WiP] Add runy base image and a fluentd image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379793 [17:04:56] (03PS2) 10ArielGlenn: move fetches of various datasets to dump module from datasets module [puppet] - 10https://gerrit.wikimedia.org/r/379790 (https://phabricator.wikimedia.org/T175528) [17:05:20] (03CR) 10jerkins-bot: [V: 04-1] move fetches of various datasets to dump module from datasets module [puppet] - 10https://gerrit.wikimedia.org/r/379790 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [17:06:27] (03PS1) 10Andrew Bogott: Revert "labtest: don't override the labtest puppetmaster ca_server" [puppet] - 10https://gerrit.wikimedia.org/r/379795 [17:07:03] (03CR) 10Andrew Bogott: [C: 032] Revert "labtest: don't override the labtest puppetmaster ca_server" [puppet] - 10https://gerrit.wikimedia.org/r/379795 (owner: 10Andrew Bogott) [17:09:14] (03PS3) 10ArielGlenn: move fetches of various datasets to dump module from datasets module [puppet] - 10https://gerrit.wikimedia.org/r/379790 (https://phabricator.wikimedia.org/T175528) [17:12:00] (03CR) 10Dzahn: "aha, so the instance info tells us the creator, Created by" [puppet] - 10https://gerrit.wikimedia.org/r/379499 (owner: 10Elukey) [17:12:48] (03CR) 10Dzahn: [C: 031] "oh, somebody already did :), i would if we get a +1 from leszek then merge it" [puppet] - 10https://gerrit.wikimedia.org/r/379499 (owner: 10Elukey) [17:25:07] 10Operations, 10ops-eqiad: rack/setup/install flerovium.eqiad.wmnet - https://phabricator.wikimedia.org/T176505#3627849 (10RobH) [17:25:09] 10Operations, 10ops-codfw: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3627866 (10RobH) [17:25:30] 10Operations, 10ops-eqiad: relabel WMF3083 as frdb1003 - https://phabricator.wikimedia.org/T176507#3627884 (10Jgreen) [17:32:32] (03PS1) 10BBlack: LVS: turn off ip_early_demux [puppet] - 10https://gerrit.wikimedia.org/r/379798 [17:32:34] (03PS1) 10BBlack: Global: Turn off ethernet flow for all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/379799 [17:32:36] (03PS1) 10BBlack: LVS: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379800 [17:32:38] (03PS1) 10BBlack: Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 [17:33:07] (03CR) 10jerkins-bot: [V: 04-1] Global: Turn off ethernet flow for all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/379799 (owner: 10BBlack) [17:33:18] (03CR) 10jerkins-bot: [V: 04-1] LVS: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379800 (owner: 10BBlack) [17:33:36] (03PS1) 10Madhuvishy: public_dumps: Create initial role for public dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/379802 [17:33:41] (03CR) 10jerkins-bot: [V: 04-1] Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 (owner: 10BBlack) [17:35:25] (03PS2) 10BBlack: Global: Turn off ethernet flow for all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/379799 [17:35:27] (03PS2) 10BBlack: LVS: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379800 [17:35:29] (03PS2) 10BBlack: Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 [17:36:07] (03CR) 10jerkins-bot: [V: 04-1] LVS: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379800 (owner: 10BBlack) [17:36:21] (03CR) 10jerkins-bot: [V: 04-1] Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 (owner: 10BBlack) [17:36:49] (03CR) 10Madhuvishy: [C: 032] public_dumps: Create initial role for public dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/379802 (owner: 10Madhuvishy) [17:37:56] (03PS3) 10BBlack: LVS: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379800 [17:37:58] (03PS3) 10BBlack: Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 [17:38:41] (03CR) 10jerkins-bot: [V: 04-1] Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 (owner: 10BBlack) [17:42:59] I hate you too jenkins :P [17:43:09] (03PS4) 10BBlack: Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 [17:43:44] hrhr [17:43:50] (03CR) 10jerkins-bot: [V: 04-1] Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 (owner: 10BBlack) [17:47:18] (03PS4) 10BryanDavis: wmcs: Add wikireplica_dns management script [puppet] - 10https://gerrit.wikimedia.org/r/378739 (https://phabricator.wikimedia.org/T174860) [17:47:55] (03PS1) 10Andrew Bogott: labs puppetmaster: install observerenv [puppet] - 10https://gerrit.wikimedia.org/r/379804 [17:48:20] (03CR) 10BryanDavis: "PS4 adds the tools.db.svc.eqiad.wmflabs service name. This is ready to merge." [puppet] - 10https://gerrit.wikimedia.org/r/378739 (https://phabricator.wikimedia.org/T174860) (owner: 10BryanDavis) [17:48:55] (03PS5) 10BBlack: Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 [17:50:36] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: install observerenv [puppet] - 10https://gerrit.wikimedia.org/r/379804 (owner: 10Andrew Bogott) [17:54:13] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3627931 (10cwdent) [17:54:15] 10Operations, 10fundraising-tech-ops: Long term storage for frack prometheus data - https://phabricator.wikimedia.org/T175738#3627929 (10cwdent) 05Open>03Resolved We will look into aggregated stats again later but there were spare 1TB disks on the lvs servers so I moved the prometheus backend there and set... [18:06:06] (03PS4) 10ArielGlenn: move fetches of various datasets to dump module from datasets module [puppet] - 10https://gerrit.wikimedia.org/r/379790 (https://phabricator.wikimedia.org/T175528) [18:11:13] (03Draft1) 10Paladox: Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 [18:11:16] (03PS2) 10Paladox: Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 [18:11:24] (03PS3) 10Paladox: Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 [18:18:01] (03PS1) 10Madhuvishy: public_dumps: Set up initial module and profile, add to role [puppet] - 10https://gerrit.wikimedia.org/r/379810 (https://phabricator.wikimedia.org/T171539) [18:20:48] (03CR) 10Madhuvishy: [C: 032] public_dumps: Set up initial module and profile, add to role [puppet] - 10https://gerrit.wikimedia.org/r/379810 (https://phabricator.wikimedia.org/T171539) (owner: 10Madhuvishy) [18:26:33] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.18 [keeping static files] (duration: 01m 29s) [18:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:04] (03CR) 10Dzahn: [C: 031] Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 (owner: 10Paladox) [18:32:36] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:32:46] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:32:46] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:32:47] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:32:58] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:32:58] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:32:58] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:32:58] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:32:58] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:06] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:06] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:06] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:06] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:07] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:07] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:07] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:08] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:16] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:16] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:16] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:16] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:17] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:26] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:27] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get) [18:33:39] uh [18:35:37] urandom: [18:35:39] eh [18:37:36] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:41:21] 10Operations, 10Edit-Review-Improvements, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10Performance: Systematically test load speeds of Watchlist and Recent Changes - https://phabricator.wikimedia.org/T176445#3628015 (10jmatazzoni) [18:47:36] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153) [18:47:46] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [18:51:27] 10Operations, 10Analytics, 10monitoring, 10Patch-For-Review: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3628043 (10Nuria) [18:52:46] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 7.07 ms [18:52:56] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 37.82 ms [19:05:56] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [19:13:52] (03PS1) 10Krinkle: webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) [19:15:39] (03CR) 10Krinkle: "Test fails as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) (owner: 10Krinkle) [19:36:53] (03CR) 1020after4: [C: 031] phragile: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/379499 (owner: 10Elukey) [19:39:40] (03CR) 1020after4: [C: 031] Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 (owner: 10Paladox) [19:41:16] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:43:25] (03CR) 10Dzahn: [C: 031] Gerrit: Enable ui for slaves [puppet] - 10https://gerrit.wikimedia.org/r/379420 (owner: 10Paladox) [19:46:45] (03PS4) 1020after4: Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 (owner: 10Paladox) [19:47:09] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 (owner: 10Paladox) [19:48:10] (03PS5) 1020after4: Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 (owner: 10Paladox) [19:49:33] (03CR) 1020after4: [C: 031] Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 (owner: 10Paladox) [19:54:50] (03PS1) 10Bmansurov: Implement Schema:Print purging strategy [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) [19:55:13] (03PS5) 10Krinkle: webperf: Limit by-country navtiming breakdown to those with 5+ hits/min [puppet] - 10https://gerrit.wikimedia.org/r/377806 (https://phabricator.wikimedia.org/T166390) [19:55:15] (03PS2) 10Krinkle: webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) [19:55:17] (03PS1) 10Krinkle: [WIP] webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 [19:57:39] (03PS17) 10MarcoAurelio: Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) [19:57:41] ACKNOWLEDGEMENT - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:41] ACKNOWLEDGEMENT - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:41] ACKNOWLEDGEMENT - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:41] ACKNOWLEDGEMENT - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:41] ACKNOWLEDGEMENT - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:41] ACKNOWLEDGEMENT - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:42] ACKNOWLEDGEMENT - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:42] ACKNOWLEDGEMENT - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:43] ACKNOWLEDGEMENT - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:43] ACKNOWLEDGEMENT - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:44] ACKNOWLEDGEMENT - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:44] ACKNOWLEDGEMENT - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:45] ACKNOWLEDGEMENT - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:45] ACKNOWLEDGEMENT - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:46] ACKNOWLEDGEMENT - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:46] ACKNOWLEDGEMENT - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:47] ACKNOWLEDGEMENT - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:47] ACKNOWLEDGEMENT - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:48] ACKNOWLEDGEMENT - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:57:48] ACKNOWLEDGEMENT - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke There is a bug here that dropped optional fields. However, severity is low (missing optional thumb), so no need to stress over the weekend. [19:58:35] (03PS5) 10Andrew Bogott: WIP: nova: turn off hourly instance usage audits [puppet] - 10https://gerrit.wikimedia.org/r/377187 [19:58:55] ACKNOWLEDGEMENT - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke Non-critical bug that dropped optional response properties. [19:58:55] ACKNOWLEDGEMENT - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke Non-critical bug that dropped optional response properties. [19:58:55] ACKNOWLEDGEMENT - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke Non-critical bug that dropped optional response properties. [19:58:55] ACKNOWLEDGEMENT - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body (AttributeError: NoneType object has no attribute get): gwicke Non-critical bug that dropped optional response properties. [20:02:48] (03PS2) 10Bmansurov: Implement Schema:Print purging strategy [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) [20:05:06] (03PS2) 10Krinkle: [WIP] webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 [20:05:38] (03CR) 10jerkins-bot: [V: 04-1] [WIP] webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 (owner: 10Krinkle) [20:05:42] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, 10Wikidata-Sprint-2016-11-08: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3628244 (10Dzahn) Hey @Lydia_Pintscher Happy to work on this and talk to you maybe on IRC as well. I would say one of the... [20:06:11] (03PS3) 10Krinkle: [WIP] webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 [20:08:33] (03PS4) 10Krinkle: [WIP] webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 [20:10:52] (03CR) 10MarcoAurelio: "Wouldn't it be simpler if all changes to operations/mediawiki-config be in a single patch? Not wishing to put myself as an example but it " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:11:30] (03PS5) 10Krinkle: [WIP] webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 [20:11:32] (03PS3) 10Krinkle: webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) [20:12:13] (03CR) 10Chad: "What Marco said: let's please deploy new wikis with as much initial configuration as possible." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:12:15] (03CR) 10Ladsgroup: "meh, I love small patches." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:12:26] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:13:54] (03CR) 10Ladsgroup: "okay, I fix it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:18:53] (03CR) 10MarcoAurelio: Add config for amwikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378400 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:21:36] (03PS6) 10Krinkle: [WIP] webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 [20:21:38] (03PS4) 10Krinkle: webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) [20:22:24] (03CR) 10jerkins-bot: [V: 04-1] webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) (owner: 10Krinkle) [20:25:11] (03PS7) 10Krinkle: [WIP] webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 [20:25:13] (03PS5) 10Krinkle: webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) [20:26:05] (03CR) 10jerkins-bot: [V: 04-1] webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) (owner: 10Krinkle) [20:28:49] 10Operations, 10Edit-Review-Improvements, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10Performance: Systematically test load speeds of Watchlist and Recent Changes - https://phabricator.wikimedia.org/T176445#3628281 (10jmatazzoni) [20:30:07] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [20:30:07] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [20:30:16] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [20:30:16] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [20:30:17] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [20:30:26] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [20:30:27] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [20:30:27] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [20:30:36] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [20:30:36] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [20:30:36] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [20:30:37] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [20:30:37] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [20:30:37] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [20:30:37] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [20:30:46] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [20:30:47] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [20:30:47] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [20:30:47] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [20:30:56] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [20:30:57] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [20:31:06] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [20:31:06] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [20:31:06] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [20:32:17] 10Operations, 10Edit-Review-Improvements, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10Performance: Systematically test load speeds of Watchlist and Recent Changes - https://phabricator.wikimedia.org/T176445#3628285 (10jmatazzoni) [20:39:39] 10Operations, 10Ops-Access-Requests: Requesting access to stat1005 for Slaporte - https://phabricator.wikimedia.org/T176518#3628320 (10Slaporte) [20:40:02] 10Operations, 10Ops-Access-Requests: Requesting access to stat1005 for Slaporte - https://phabricator.wikimedia.org/T176518#3628337 (10Slaporte) [20:42:12] 10Operations, 10Ops-Access-Requests: Requesting access to stat1005 for Slaporte - https://phabricator.wikimedia.org/T176518#3628343 (10ZhouZ) I can confirm Stephen is taking over this. [20:43:17] (03PS8) 10Krinkle: [WIP] webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 [20:43:19] (03PS6) 10Krinkle: webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) [20:44:10] (03CR) 10jerkins-bot: [V: 04-1] webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) (owner: 10Krinkle) [20:47:35] 10Operations, 10Ops-Access-Requests: Requesting access to stat1005 for Slaporte - https://phabricator.wikimedia.org/T176518#3628347 (10Zoranzoki21) a:03Zoranzoki21 [20:50:08] hmm... don't we need a C-level stuff to take over on that? ^ [20:50:17] s/stuff/staff [20:53:46] (03PS4) 10Hashar: contint: php5.5 on permanent slaves [puppet] - 10https://gerrit.wikimedia.org/r/377529 [20:55:00] (03PS5) 10Hashar: contint: php5.5 on permanent slaves [puppet] - 10https://gerrit.wikimedia.org/r/377529 (https://phabricator.wikimedia.org/T174972) [20:58:12] (03Draft2) 10Zoranzoki21: Access for Slaporte (Stephen LaPorte) to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) [21:00:11] (03CR) 10Zoranzoki21: "I do not know what to add in uid.. I added uid: 176518" [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [21:01:01] (03CR) 10Hashar: [C: 031] "Prior patchsets were using an aptly repo on labs. Now that the package are uploaded on apt.wikimedia.org in jessie-wikimedia/component/ci " [puppet] - 10https://gerrit.wikimedia.org/r/377529 (https://phabricator.wikimedia.org/T174972) (owner: 10Hashar) [21:04:15] tabbycat: can you -2 the patchset, ops havn't reviewed for start [21:04:32] UID looks wrong [21:04:51] p858snake: I can't -2 [21:05:20] I'm a simple user :) [21:07:36] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] [21:12:29] (03CR) 10Hashar: [C: 04-1] Access for Slaporte (Stephen LaPorte) to stat1005 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [21:12:56] p858snake: tabbycat: yeah the uid is random :) [21:13:06] I added some basic comments on the patchset [21:14:23] 10Operations, 10Ops-Access-Requests: Requesting access to stat1005 for Slaporte - https://phabricator.wikimedia.org/T176518#3628320 (10hashar) Note the public key is used on labs and IIRC access to production requires a different ssh key. @Zoranzoki21 provided a patch in Gerrit at https://gerrit.wikimedia.org... [21:15:04] (03PS4) 10Halfak: Adds myspell-lv package to ores::base Switches myspell-uk to aspell-uk (better package) Bug: T175628 Bug: T175627 Change-Id: I6d917712a44f404a9a2737c4c58df12f4ee15547 [puppet] - 10https://gerrit.wikimedia.org/r/377327 (https://phabricator.wikimedia.org/T175628) [21:15:33] (03CR) 10jerkins-bot: [V: 04-1] Adds myspell-lv package to ores::base Switches myspell-uk to aspell-uk (better package) Bug: T175628 Bug: T175627 Change-Id: I6d917712a44f404a9a2737c4c58df12f4ee15547 [puppet] - 10https://gerrit.wikimedia.org/r/377327 (https://phabricator.wikimedia.org/T175628) (owner: 10Halfak) [21:15:53] (03PS5) 10Halfak: Adds myspell-lv package to ores::base Switches myspell-uk to aspell-uk (better package) Bug: T175628 Bug: T175627 Change-Id: I6d917712a44f404a9a2737c4c58df12f4ee15547 [puppet] - 10https://gerrit.wikimedia.org/r/377327 (https://phabricator.wikimedia.org/T175628) [21:16:20] (03CR) 10jerkins-bot: [V: 04-1] Adds myspell-lv package to ores::base Switches myspell-uk to aspell-uk (better package) Bug: T175628 Bug: T175627 Change-Id: I6d917712a44f404a9a2737c4c58df12f4ee15547 [puppet] - 10https://gerrit.wikimedia.org/r/377327 (https://phabricator.wikimedia.org/T175628) (owner: 10Halfak) [21:17:20] ACKNOWLEDGEMENT - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [140.0] amusso Transient mass changes on mediawiki/core [21:18:20] (03CR) 10Hashar: Adds myspell-lv package to ores::base Switches myspell-uk to aspell-uk (better package) Bug: T175628 Bug: T175627 Change-Id: I6d917712a44f40 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377327 (https://phabricator.wikimedia.org/T175628) (owner: 10Halfak) [21:18:38] halfak: hi you are missing a newline in the commit message :] [21:19:17] hashar, where? between the message and "Bug: "? [21:19:32] (03PS3) 10Zoranzoki21: Access for Slaporte (Stephen LaPorte) to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) [21:20:03] (03PS6) 10Halfak: Adds myspell-lv package to ores::base Switches myspell-uk to aspell-uk (better package) [puppet] - 10https://gerrit.wikimedia.org/r/377327 (https://phabricator.wikimedia.org/T175628) [21:20:06] Arg! [21:20:09] That should do it [21:20:24] (03CR) 10jerkins-bot: [V: 04-1] Adds myspell-lv package to ores::base Switches myspell-uk to aspell-uk (better package) [puppet] - 10https://gerrit.wikimedia.org/r/377327 (https://phabricator.wikimedia.org/T175628) (owner: 10Halfak) [21:20:52] (03PS4) 10Zoranzoki21: Access for Slaporte (Stephen LaPorte) to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) [21:21:08] (03CR) 10Zoranzoki21: Access for Slaporte (Stephen LaPorte) to stat1005 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [21:22:29] halfak: you should be able to reproduce locally though by just running "tox" :) [21:25:51] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [21:26:40] (03CR) 10Zoranzoki21: "@Hashar How to I work recheck?" [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [21:26:55] (03CR) 10Hashar: "Thanks Zoranzoki21 :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [21:27:24] * halfak runs tox [21:27:42] halfak: specially tox -e commit-message [21:27:50] that is a python soft that validates.. the commit message! :] [21:28:01] hashar: https://www.mediawiki.org/wiki/Commit-message-validator has instructions on how to set it up as a git hook [21:28:07] halfak: ^ [21:28:15] magic [21:28:43] on those good words. I am off for the week-end! Happy hacking everyone :] [21:29:01] (03PS7) 10Halfak: Adds myspell-lv, myspell-uk to aspell-uk to ores::base [puppet] - 10https://gerrit.wikimedia.org/r/377327 (https://phabricator.wikimedia.org/T175628) [21:29:09] o/ hashar [21:29:12] thanks for the help [21:29:15] also legoktm :D [21:30:45] (03CR) 10Zoranzoki21: "Ok @Hashar" [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [21:31:49] * hashar waves [21:32:05] 10Operations, 10Ops-Access-Requests: Requesting access to stat1005 for Slaporte - https://phabricator.wikimedia.org/T176518#3628562 (10Zoranzoki21) >>! In T176518#3628460, @hashar wrote: > Note the public key is used on labs and IIRC access to production requires a different ssh key. > > @Zoranzoki21 provided... [21:35:46] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [21:35:46] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [21:35:46] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [21:35:46] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [21:35:46] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [21:35:47] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [21:35:47] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [21:35:48] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [21:35:48] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [21:35:49] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [21:35:49] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [21:35:56] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [21:35:56] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [21:35:56] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [21:35:56] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [21:35:56] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [21:35:57] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [21:35:57] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [21:35:58] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [21:36:17] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [21:36:27] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [21:36:27] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [21:36:30] (03PS1) 10Greg Grossmeier: admin: Add gjg to contint-admin [puppet] - 10https://gerrit.wikimedia.org/r/379932 [21:38:51] 10Operations, 10Ops-Access-Requests: Requesting access to stat1005 for Slaporte - https://phabricator.wikimedia.org/T176518#3628614 (10Slaporte) >>! In T176518#3628460, @hashar wrote: > Note the public key is used on labs and IIRC access to production requires a different ssh key. Here is a different key: ``... [21:39:46] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:39:46] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:39:46] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:39:46] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:39:47] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:39:47] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:39:47] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:39:48] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:39:48] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:39:49] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:39:56] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:39:57] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:39:57] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [21:39:57] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:39:57] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:39:57] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:39:57] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [21:40:17] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:40:26] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:40:26] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:51:56] (03PS5) 10Zoranzoki21: Access for Slaporte (Stephen LaPorte) to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) [21:52:41] 10Operations, 10Ops-Access-Requests: Requesting access to stat1005 for Slaporte - https://phabricator.wikimedia.org/T176518#3628639 (10Zoranzoki21) >>! In T176518#3628614, @Slaporte wrote: >>>! In T176518#3628460, @hashar wrote: >> Note the public key is used on labs and IIRC access to production requires a di... [21:53:13] (03CR) 10Zoranzoki21: Access for Slaporte (Stephen LaPorte) to stat1005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [22:08:48] 10Operations, 10Ops-Access-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3628679 (10Zoranzoki21) a:03Zoranzoki21 I support this. I will made a patch when https://gerrit.wikimedia.org/r/#/c/379851/ and https://gerrit.wikimedia.org/r/#/c/379932/1 be... [22:10:04] (03CR) 10Zoranzoki21: [C: 031] admin: Add gjg to contint-admin [puppet] - 10https://gerrit.wikimedia.org/r/379932 (owner: 10Greg Grossmeier) [22:13:50] 10Operations, 10Ops-Access-Requests, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3628686 (10zhuyifei1999) [22:14:18] 10Operations, 10Ops-Access-Requests, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3622856 (10MarcoAurelio) @Zoranzoki21 This needs to be supported/approved by some people, NDAS, etc. I suggest you un-claim the task so the relevant peopl... [22:14:22] (03PS1) 10Andrew Bogott: labtest: include salt profile on labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/379940 [22:14:47] (03CR) 10jerkins-bot: [V: 04-1] labtest: include salt profile on labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/379940 (owner: 10Andrew Bogott) [22:15:22] 10Operations, 10Ops-Access-Requests, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3628690 (10Zoranzoki21) a:05Zoranzoki21>03None >>! In T176364#3628688, @MarcoAurelio wrote: > @Zoranzoki21 This needs to be supported/approved by some... [22:15:53] 10Operations, 10Ops-Access-Requests, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3628693 (10Dereckson) To better understand your request, could you give a sample of tasks you would like to create with LogStash access? At what frequenc... [22:17:23] (03PS2) 10Andrew Bogott: labtest: include salt profile on labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/379940 [22:17:59] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3628696 (10zhuyifei1999) [22:18:23] (03CR) 10Andrew Bogott: [C: 032] labtest: include salt profile on labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/379940 (owner: 10Andrew Bogott) [22:19:10] (03CR) 10Dzahn: Access for Slaporte (Stephen LaPorte) to stat1005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [22:20:38] (03CR) 10Zoranzoki21: Access for Slaporte (Stephen LaPorte) to stat1005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [22:21:23] Dereckson: so does https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170925T1100 look right to you? [22:21:35] (03CR) 10Dzahn: Fix problem with throttle rule for John Michael Kohler Art Center. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379661 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [22:21:47] (03PS6) 10Zoranzoki21: Access for Slaporte (Stephen LaPorte) to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) [22:22:39] (03CR) 10Dzahn: [C: 031] admin: Add gjg to contint-admin [puppet] - 10https://gerrit.wikimedia.org/r/379932 (owner: 10Greg Grossmeier) [22:23:01] tabbycat: looking [22:23:23] tabbycat: Why are you backporting https://gerrit.wikimedia.org/r/#/c/356362/? [22:23:43] (03CR) 10Zoranzoki21: "@Dzahn Email address changed on @wikimedia.org domain, per email from new generated ssh key" [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [22:23:52] Niharika: it was requested in a task... wasn't it needed? [22:24:12] tabbycat: It's been long deployed with the train. [22:24:32] tabbycat: See https://github.com/wikimedia/mediawiki-extensions-BlockAndNuke/blob/master/BanPests.php [22:24:47] Niharika: sorry, I'm so noob... Does that mean that no backport to other REL* are needed? [22:25:50] tabbycat: Yep, no backports are needed. Anything that gets +2ed and merged goes out with the train unless somebody wants it out sooner. In that case cherry-picking and SWAT are required. [22:26:01] The train runs every week. [22:26:09] * tabbycat is confused [22:26:21] so... why are RELs for? [22:26:30] backports to REL_ branches don't get pushed to production, we only use those for tarballs [22:26:34] ^ [22:26:34] releases [22:26:44] block and nuke ain't deployed to the wikimedia cluster though [22:27:18] if they are needed for a new tarball release of those REL_ branches, then sure [22:27:36] okay, let me get this... so once the change in master was merged... it also gots added to the other REL* ? [22:27:47] see T173687 [22:27:48] T173687: Block and Nuke broken in REL1_27 branch due to whitelist truncation - please backport patch from master - https://phabricator.wikimedia.org/T173687 [22:28:09] tabbycat: no, I think there was confusion from others at the beginning [22:28:28] Ah, I didn't know about the tarball being different from production. [22:28:30] tl;dr: you did it right, if this is something that needs to be added to and create a new release for those past releases [22:28:48] My bad. Sorry for the confusion tabbycat. [22:28:55] :) [22:29:37] greg-g: I think that's what they requested, because the extension is 'broken' from REL1_27 onwards [22:29:41] !log gerrit2001 - systemd says gerrit.service is failed. gerrit.sh start says "Already Running!!" :p - cobalt is fine [22:29:43] Niharika: no problem! :) [22:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:06] I still have a lot to learn in this new world of the development [22:31:27] * tabbycat still remembers the first time he read 'cherry-pick' and was like... wtf have cherries to do with software o_O [22:32:03] :D [22:32:24] let's hope they don't create a 'potato-pick' or something else [22:32:38] or a 'goat-pick'... goats are in fashion on Wikimedia lately [22:33:50] Pray no German hears that... [22:34:18] !log gerrit2001 - stopping gerrit with gerrit.sh stop, letting puppet start it again [22:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:57] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [22:36:16] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [22:37:57] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [22:37:57] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [22:37:57] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [22:37:57] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [22:37:57] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [22:37:57] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [22:37:57] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [22:38:06] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [22:38:07] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [22:38:07] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [22:38:16] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [22:38:16] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [22:38:16] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [22:38:16] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [22:38:16] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [22:38:17] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [22:38:17] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [22:38:36] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [22:38:37] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [22:38:37] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [22:43:06] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [22:43:11] (03PS6) 10Andrew Bogott: nova: turn off hourly instance usage audits [puppet] - 10https://gerrit.wikimedia.org/r/377187 [22:43:35] (03CR) 10Andrew Bogott: [C: 031] "I don't see this causing any harm on labtest. I'll merge in prod when I'm going to be around for a while to watch." [puppet] - 10https://gerrit.wikimedia.org/r/377187 (owner: 10Andrew Bogott) [22:45:06] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [22:45:06] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [22:45:36] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [22:45:37] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [22:45:37] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [22:45:37] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [22:45:46] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [22:48:05] (03PS7) 10Krinkle: webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) [22:52:07] no_justification hi, do you know why we comment this [22:52:08] # '-Dlog4j.configuration=file:///var/lib/gerrit2/review_site/etc/log4j.properties', [22:52:09] out? [22:52:19] please [22:52:50] Uh cuz it wasn't working. Did it not get uncommented when we fixed the settings? [22:53:19] nope [22:53:20] was wondering why it was commented out. [22:53:21] thanks for explaning [22:53:29] That'd explain why fixed settings didn't do anything [22:53:33] (03Draft1) 10Paladox: Gerrit: Remove gc logging [puppet] - 10https://gerrit.wikimedia.org/r/379946 [22:53:36] (03PS2) 10Paladox: Gerrit: Remove gc logging [puppet] - 10https://gerrit.wikimedia.org/r/379946 [22:54:08] (03CR) 10Chad: "Let's comment them out instead so we have them for reference again if need be" [puppet] - 10https://gerrit.wikimedia.org/r/379946 (owner: 10Paladox) [22:54:17] (03CR) 10Chad: "Maybe with a comment what they're for" [puppet] - 10https://gerrit.wikimedia.org/r/379946 (owner: 10Paladox) [22:55:13] (03PS3) 10Paladox: Gerrit: Remove gc logging [puppet] - 10https://gerrit.wikimedia.org/r/379946 [22:56:31] !log gerrit2001 - trying to manually start gerrit again, as opposed to puppet doing it.. debugging gerrit-ssh issue there, cobalt still untouched [22:56:33] (03CR) 10Chad: [C: 031] Gerrit: Remove gc logging [puppet] - 10https://gerrit.wikimedia.org/r/379946 (owner: 10Paladox) [22:56:42] thanks :) [22:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:10] it also fails to start when i just manually use gerrit.sh start [22:57:15] hmm [22:57:17] try [22:57:19] so should not even be systemd related [22:57:20] bin/gerrit.sh run [22:57:58] that should at least give us something since it dosen't seem to be writing to the logs [22:58:00] whats the difference between running and starting [22:58:09] running shows everything [22:58:16] Starting Gerrit Code Review: FAILED [22:58:22] Running Gerrit Code Review: [22:58:27] Already Running!! [22:58:37] it shows you what the log should show [22:58:42] hmm [22:58:46] try bin/gerrit.sh stop [22:58:49] bin/gerrit.sh run [23:00:11] (03PS1) 10Krinkle: Enable jQuery 3 on Wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379947 (https://phabricator.wikimedia.org/T124742) [23:00:13] (03PS1) 10Krinkle: Enable jQuery 3 on most group1 wikis (non-Wikipedia) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379948 (https://phabricator.wikimedia.org/T124742) [23:00:15] (03PS1) 10Krinkle: Enable jQuery 3 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379949 [23:02:12] mutante did bin/gerrit.sh run show anything? [23:02:53] paladox: yes, what i pasted, it shouts at me "Already Running!!" [23:02:59] ok [23:03:06] ok, one sec [23:03:07] mutante try stopping and then run run [23:03:09] ok [23:03:10] thanks [23:04:15] doing , stop/run [23:04:24] thanks [23:04:29] yes, it calls it "Running" as opposed to "starting" [23:04:37] (03PS2) 10Krinkle: Enable jQuery 3 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379949 (https://phabricator.wikimedia.org/T124742) [23:05:23] 10Operations, 10Ops-Access-Requests: Requesting access to production bastions for cwdent - https://phabricator.wikimedia.org/T176529#3628771 (10cwdent) [23:05:33] yep [23:05:36] waits [23:07:03] paladox: it's like it runs but still doesnt start the ssh daemon [23:07:16] no output where i typed "run" but also no failure and doesnt quit [23:07:25] does it show anything in log format [23:07:28] in the screen [23:07:29] ? [23:07:48] no, just "Running Gerrit Code Review: " and sits there [23:08:06] tries something else [23:08:23] hmm [23:08:49] unlinks plugin dir [23:08:52] gerrit.sh start [23:09:38] i guess we found a problem. Some how it's failing to allow gerrit to start on gerrit2001 [23:10:00] well yea :) [23:10:05] lets do a reboot to see if that will clear anything that is blocking it. [23:10:12] and it all started since i restarted the service once [23:10:27] ah [23:10:30] lets try a reboot [23:10:44] it could be a proc [23:11:08] i see the gerrit process running and not running .. when i do start/stop [23:11:23] there are 2 types of behaviour: [23:11:30] a) gerrit service itself fails to start [23:11:38] b) gerrit service runs [23:11:44] but in both cases.. gerrit-ssh never runs anymore [23:11:49] hmm [23:11:49] and it used to before [23:11:54] lets try a reboot. [23:12:18] doesnt want to go the windows route just yet... [23:12:30] ok [23:13:04] aha [23:13:10] bin/gerrit.sh start [23:13:15] then bin/gerrit.sh supervise [23:13:31] gerrit2@gerrit-test3:~/review_site$ bin/gerrit.sh supervise [23:13:31] [2017-09-22 23:13:05,211] [main] INFO com.google.gerrit.server.cache.h2.H2CacheFactory : Enabling disk cache /var/lib/gerrit2/review_site/cache [23:13:32] [2017-09-22 23:13:05,746] [main] WARN com.google.gerrit.server.config.AdministrateServerGroupsProvider : Group "ldap/ops" not available, skipping. [23:13:32] [2017-09-22 23:13:06,783] [main] INFO com.google.gerrit.server.config.ScheduleConfig : gc schedule parameter "gc.interval" is not configured [23:13:32] [2017-09-22 23:13:06,783] [main] INFO com.google.gerrit.server.config.ScheduleConfig : changeCleanup schedule parameter "changeCleanup.startTime" is not configured [23:13:36] but "start" said FAILED and exited again.. after some time [23:13:47] tries [23:14:44] yea, always that nice combo where it first says FAILED but then "Already Running" too [23:15:07] hmm [23:15:09] it's confused and that is when i'm not even using puppet or unit file, just the gerrit.sh [23:15:14] wth [23:15:53] GERRIT_STARTUP_TIMEOUT = 90 [23:16:02] 90 seconds [23:16:04] i think that's when i get the FAILED [23:16:08] yeh [23:16:28] if it goes passed that (even though it's configuable) then it likly indicates a problem [23:16:37] it depends on the system though [23:16:47] i just repeated "start" and this time it doesnt think it's already running, it tries it again [23:17:19] ok [23:17:45] "supervise" shows nothing so far [23:17:51] hmm [23:18:26] during all this.. not a single line in error_log [23:18:34] still stopped on 19th [23:18:52] am i really doing this as gerrit2 user? [23:19:03] it can definitely write into that log [23:19:13] hmm [23:19:16] if it can write [23:19:29] then something is blocking gerrit from doing it [23:19:34] i tested writing to it as gerrit2, can [23:19:58] ok [23:20:03] so gerrit is having the problem [23:20:17] but if it works on cobalt, it has to be something other then a gerrit config [23:20:41] will not touch cobalt, it's possible it would break there too if we do [23:21:18] and it just wasnt restarted since X [23:21:40] much better to know first what is the issue on the non-active one [23:21:51] but could also be that it's all just because this one uses --slave [23:22:00] and without it it wouldnt have the problem [23:22:01] i used --slave [23:22:03] and works for me [23:22:06] ok.. [23:22:07] writes to logs too [23:24:12] oh [23:24:21] you know what.. apt-get upgrade would upgrade gerrit here [23:25:14] yep [23:25:17] ah [23:25:25] is there an update? [23:25:34] wait, i dont get it. it has 2.13.8+git1-wmf.6 on both of them [23:26:26] apt-get update [23:26:26] it would install 2.13.8+git1-wmf.7 [23:26:29] apt-get upgrade [23:26:38] but still.. it's not different.. both just on .6 so far [23:27:06] hmm [23:29:26] yea, i built that package, it's just not installed yet, on either [23:29:39] hmm [23:30:29] .7 drops the plugins [23:30:34] yep [23:36:15] !log gerrit2001 - rebooting .. wave a dead chicken http://www.catb.org/jargon/html/W/wave-a-dead-chicken.html [23:36:25] lol [23:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:47] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_sysctl] [23:43:20] mutante did a reboot work? :) [23:43:24] paladox: ..no [23:43:29] oh [23:46:43] mutante does this /etc/default/gerritcodereview file exist? [23:47:09] yes, it does [23:47:16] 1 GERRIT_SITE="/var/lib/gerrit2/review_site" [23:47:21] 2 GERRIT_WAR="/var/lib/gerrit2/review_site/bin/gerrit.war" [23:47:30] ok [23:48:03] lets try bin/gerrit.sh stop && chmod -R gerrit2:gerrit2 /var/lib/gerrit2/review_site/ && bin/gerrit.sh start [23:48:16] https://www.google.co.uk/search?client=safari&channel=ipad_bm&dcr=0&source=hp&q=gerrit+not+starting&oq=gerrit+not+starting&gs_l=psy-ab.3..0.426.3519.0.3716.21.19.0.0.0.0.91.1026.19.19.0....0...1.1.64.psy-ab..2.19.1024.0..35i39k1j0i131k1j0i22i30k1j0i22i10i30k1.0.unB21B3UVgs [23:51:34] (03PS1) 10Krinkle: mediawiki/hhvm: Move fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) [23:51:54] chown, not chmod.. it doesnt fix it [23:52:04] ok [23:52:33] (03CR) 10Krinkle: "Looking for feedback on puppet logic (correct Require? Okay to require that from here? Correct Before?)." [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [23:53:08] suspects plugins dir and tries unlinking that again [23:53:49] ok [23:54:18] none of that makes a difference.. [23:55:10] does "bash -x" on the old init.d script (all methods fail) [23:55:13] ok [23:55:17] ah yep [23:55:19] bash -x [23:55:29] (03PS2) 10Krinkle: mediawiki/hhvm: Move fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) [23:55:37] so it just sits there and keeps checking what is inside the gerrit.run file [23:56:13] ok [23:56:19] and sleeps and tries agian.. until the timeout is reached [23:56:29] yeh [23:57:23] gerrit does get a pid [23:57:43] yep [23:57:45] gerrit.pid [23:58:45] i am guessing it's time for running the java command manually [23:59:18] java -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [23:59:19] mutante ^^ [23:59:56] java -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site --show-stack-trace